In my previous post you can find details of my session at Oracle Openworld 2012 where I will be talking about Big Data and Hadoop. Here I want to take a closer look at a specific existing query that runs against a relational database and instead write it as a MapReduce job to run under Hadoop as i cover in this session. I want to do this to work with an example more familiar to people who already work with databases to get a feel for contrasting technologies. Before doing so you should note that Hive can do this translation from a SQL like language for us so I will look a look at this later instead starting from the ground up.
I will not be covering an introduction to Big Data, you can find that here (add link), or covering the installation of Hadoop, you can find a lot of installation guides online. However it is useful to know that for testing I installed the Cloudera CDH4 distribution of Hadoop in pseudo-distributed mode on Oracle Linux 5.6 on a 2 socket Intel Xeon E5-2680 server with an Intel SSD 910 for storage. For a clustered environment I also ran the same workload on an Oracle Big Data Appliance which I cover in my OOW session.
Firstly to run the type of test that I want to do you need a query to port and a dataset. Fortunately there is the TPC-H workload that gives us everything we need so I will take a look at TPC-H Query 1. TPCH- Query 1 is the pricing summary report query giving the summary pricing report for lineitems shipped as of a given date. I haven't chosen query by random however, it has the advatange of being very simple to work with for purposes of an example. In particular the query accesses the data from just one table, the lineitem table and requires only one full table scan or pass through the data to satisfy the query so there are no joins or multiple MapReduce jobs to replicate the same workload in Hadoop.
In SQL Query 1 (in Oracle Database format) looks as follows:
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '112' day (3)
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus
Note that the l_shipdate in the where clause can change but in this example l_shipdate in the where clause is <= 11-AUG-98. The easisest way to check this is from within the Oracle database, for example:
SQL> select date '1998-12-01' - interval '112' day (3) from dual;n
The statement is a number of aggregate functions (namely SUM(), AVG(), COUNT()) grouped and ordered by l_returnflag, l_linestatus so these will need to be implemented in Hadoop.
As there is only one table queried (from lineitem) only the data for this table is required. To create the data for this table download and compile the dbgen utility from the TPC and generate the lineitem.tbl data file as follows:
[oracle@sandep1 ~]$ ./dbgen -s 100 -T Lnn
I have chosen a Scale Factor 100 schema resulting in a lineitem file of almost 80GB containing approximately 600 million rows.
[oracle@sandep1 input]$ ls -l lineitem.tblnn
-rw-r--r-- 1 hadoop hadoop 79579694556 Jul 13 11:26 lineitem.tbl
The first five lines of this file look as follows:
1|15518935|768951|1|17|33203.72|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|nn
1|6730908|730909|2|36|69788.52|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE BACK RETURN|MAIL|ly final dependencies: slyly bold |
1|6369978|369979|3|8|16381.28|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE BACK RETURN|REG AIR|riously. regular, express dep|
1|213150|463151|4|28|29767.92|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites. fluffily even de|
1|2402664|152671|5|24|37596.96|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB| pending foxes. slyly re|
and if you were to load this data into an Oracle database the fields in the lineitem.tbl file would correspond as follows to the lineitem table (You need to know this in order to know what fields to extract when writing a job for Hadoop).
CREATE TABLE lineitem (L_ORDERKEY NUMBER(10),nn
However as the data is unstructured you can go ahead and load it straight into the Hadoop distributed file system. The following example specifies a 256MB block size.
/hadoop/hadoop-2.0.0-cdh4.0.1/bin/hadoop dfs -Ddfs.block.size=268435456 -put /hadoop2/input/lineitem.tbl /user/input/tpch/lineitem256.tblnn
Now we have both the sample data loaded into Hadoop and an example query in the next post I will look at rewriting this same query as a MapReduce job.