In the first place, why would you want to put together LucidDB (a single-node column store relational database with no map/reduce execution) with Hive (a SQL layer built on top of Hadoop's map/reduce architecture)?
Well, Hive is great for storing and crunching massive (up to petabyte-scale) amounts of data on large clusters of machines (up to thousands of nodes). However, Hadoop scheduling and lack of indexing can mean that queries which only need to reference small amounts of data may not come back with results for minutes (or even hours on a very busy cluster). This can make it frustrating or impossible to carry out OLAP, exploratory or iterative analysis over small subsets of data within Hive. Also, Hive's support for the SQL language is still incomplete, making it more difficult to express some queries.
In contrast, LucidDB can comfortably handle amounts of data under a terabyte (with much better performance than a traditional row-oriented databases), and supports features such as bitmap indexing and star join optimization, which are important for fast response times on analytic queries. And its SQL support and client drivers are quite a bit more advanced and standards-compliant than Hive. However, LucidDB can not yet scale out to a cluster in the way that Hive can, and LucidDB does not yet parallelize execution of individual queries even on a single node. So for "big data" queries (e.g. over tens of terabytes of data), LucidDB by itself can't do the job.
So, for huge data warehouses, it makes sense to introduce a division of labor in which LucidDB servers are used for creating and accessing small "data marts" hanging off of a large Hive cluster. In addition, LucidDB supports the ability to push down SQL queries into foreign database servers such as Hive, so even when the data is all kept in Hive, in some cases it may be possible to use LucidDB as a middleman for translating standard/advanced SQL into the dialect understood by Hive. The fact that both Hive and LucidDB are Java-based makes integration smooth (a single JVM can host the LucidDB server with the Hive client components completely embedded).
The rest of this document explains how to set up the integration, what is currently supported, and the roadmap for tighter integration going forward.
There is a short technical video from Nicholas Goodman that demonstrates this experimental integration.
Here's what you'll need:
- An installation of LucidDB 0.9.3 or later (use the download links on the LucidDB home page).
- A Hive build from latest trunk (follow the instructions in the Hive wiki), with HADOOP_HOME set
- Shut down your LucidDB server if it is still running
- Download Honeybear.2.patch and apply it to your Hive trunk checkout with patch -p0 < Honeybear.2.patch
- Run Hive's ant package command to rebuild Hive
- export LUCIDDB_HOME=/path/to/your/install/of/luciddb-0.9.3
- From Hive trunk, run build/dist/bin/hive --service honeybear
This will start up the Honeybear service, with LucidDB running as a server containing an embedded instance of the Hive client. (Note that while this is running, you should not try to start running the normal LucidDB server script at the same time; it will fail since they are configured to share the same location on disk. Likewise, if you are using Hive's default Derby metastore, you won't be able to start the Hive CLI while the Honeybear service is running.)
You will see a lot more noise on usual than console, since the Hadoop startup redirects log4j output to console, allowing you to diagnose problems there. LucidDB-specific logging will mostly still go to luciddb-0.9.3/trace/Trace.log.
Now you can use LucidDB sqllineClient to connect and issue LucidDB SQL commands to Honeybear.
Running both LucidDB and Hive Servers
Alternatively, if you'd like to run two different servers so that
- LucidDB is running using its standard scripts
- Hive is running using its standard scripts and available to remote clients
Follow steps above all the way through rebuilding Hive. Then do the following
Step 1 - Copy the following jars to $LUCIDDB_HOME/lib/hadoop (new directory)
from hadoop: hadoop-0.20.0-core.jar from hive: hive-exec-0.7.0.jar hive-jdbc-0.7.0.jar hive-metastore-0.7.0.jar hive-service-0.7.0.jar libfb303.jar log4j-1.2.15.jar
Step 2 - If you're using an existing LucidDB server, manually edit the classpath.gen and add the newly added jars. Alternatively, simple remove classpath.gen, an rm -rf trace/ and run the "install.sh" script again to generate it.
Step 3 - Start the Hive server as you normally would
cd build/dist ./hive.sh --service hiveserver
Step 4 - Start LucidDB as you normally would
cd $LUCDDB_HOME/bin ./lucidDbServer
Only difference in subsequent sections is to use the remote URL instead of the embedded URL when creating the foreign table
Accessing Hive Tables
Next, let's see how to query Hive tables with LucidDB as a middleman. First, follow the instructions in Hive's Getting Started Guide for creating table pokes and loading data into it. Do this part from the normal Hive CLI (not from sqlline), and verify that you can select * from pokes there and see the loaded data.
Now, in sqlline, execute these commands:
create server hive_link foreign data wrapper sys_jdbc options( driver_class 'org.apache.hadoop.hive.jdbc.HiveDriver', url 'jdbc:hive://', -- use this instead if running Hive with thrift server -- url 'jdbc:hive://localhost:10000/default', user_name 'SA', schema_name 'METASTORE', table_types 'TABLE,VIEW', skip_type_check 'TRUE'); create schema hive_test; create foreign table hive_test.pokes server hive_link options (table_name 'pokes');
The first command tells LucidDB how to talk to Hive (via Hive's own JDBC driver). The second command copies metadata about table pokes from Hive's metastore into LucidDB's catalog. (Note that the rules for identifiers are different in the two systems, so in Hive, the tablename is kept lowercased, whereas in LucidDB, it is implicitly uppercased because it is unquoted.)
Now you should be able to do
select * from hive_test.pokes;
select "bar" from hive_test.pokes where "foo" = 200;
It's necessary to quote the column names because they were imported as lowercase in LucidDB's catalog.
If you do
!set maxwidth 1000 select * from sys_root.dba_columns where table_name='POKES';
you will see the imported definitions for the two columns. Note that the VARCHAR column was imported as VARCHAR(1024); that is because the Hive metastore does not put any limits on strings, whereas LucidDB does, so a default of 1024 was chosen.
You can override the column types (but not names) during the import as follows:
create foreign table hive_test.pokes2( "foo" int, "bar" varchar(128) ) server hive_link options (table_name 'pokes');
Copying Data From Hive to LucidDB
So far, we have just been using LucidDB as a middleman, without actually storing any data in it. Here's how to make a copy of a table from Hive to LucidDB, index one of the columns, and collect stats:
create table hive_test.pokes_copy( foo int, bar varchar(128)) create index pokes_bar on hive_test.pokes_copy(bar); insert into hive_test.pokes_copy select * from hive_test.pokes; analyze table hive_test.pokes_copy estimate statistics for all columns;
For the LucidDB table, we can avoid quoting the column names.
Now when you do something like select * from hive_test.pokes_copy you will see no activity in the Hive log on server console since all of the data is being retrieved directly from LucidDB's storage.
Pushing Down SQL
When you run a query like select "bar" from hive_test.pokes where "foo"=200, what happens? Currently, LucidDB pulls the entire table over from Hive and then performs the filtering and projection locally. You can see that this is happening by either examining the Hive log output, or using EXPLAIN PLAN in LucidDB:
0: jdbc:luciddb:http://localhost> explain plan for select "bar" from hive_test.pokes where "foo"=200; +--------------------------------------------------------------------------------------------------+ | | +--------------------------------------------------------------------------------------------------+ | FennelToIteratorConverter | | FennelReshapeRel(projection=[], filterOp=[COMP_EQ], filterOrdinals=[], filterTuple=[[200 | | IteratorToFennelConverter | | ResultSetToFarragoIteratorConverter | | MedJdbcQueryRel(foreignSql=[SELECT * | | FROM `pokes`]) | +--------------------------------------------------------------------------------------------------+
In some cases, this is a good thing, since when Hive sees a query of the form select * from t, it executes it without running a map/reduce job; instead, it just pulls the data directly from HDFS.
However, if the query is doing a lot of data reduction (e.g. via filtering or aggregation), then we want LucidDB to push down most of the SQL query to Hive.
LucidDB is capable of doing this for a number of SQL constructs, but currently Hive's JDBC driver prevents it from happening. The later section on roadmap discusses how we will address this. In the meantime, you can make still make this happen manually with this procedure:
- Create a view in Hive which expresses the filter condition
- Create a foreign table in LucidDB which maps to the Hive view
- Select from the foreign table in LucidDB
For example, in Hive CLI, do
create view pokes_view as select bar from pokes where foo=200;
Then in sqlline:
create foreign table hive_test.pokes_view( "bar" varchar(128) ) server hive_link options (table_name 'pokes_view'); select * from hive_test.pokes_view;
You should see a map/reduce job running on the Hive side for evaluating the view's query.
Understanding SQL Language Differences
There are many differences between Hive and LucidDB SQL support; here are a few of the major ones to be aware of:
- In almost all cases, LucidDB strictly conforms to the ISO/IEC SQL standards; Hive tends to be very lax instead (more like MySQL). For example, LucidDB prevents you from comparing strings and numerics (you instead have to explicitly cast one of them to make them compatible); Hive tries to be friendly by allowing this and auto-casting one to the other (but not necessarily with the results you expect)
- LucidDB suppports (and requires) usage of schema qualifiers on tables; Hive does not currently support them (all tables live in one global namespace)
- Unquoted identifiers in LucidDB are implicitly uppercased; in Hive, they are implicitly lowercased
- LucidDB uses the standard double-quote (") for quoting identifiers; Hive uses the MySQL backtick (`) convention instead
- LucidDB uses the standard INSERT INTO T SELECT ... whereas Hive uses its extension INSERT OVERWRITE TABLE T SELECT ... as well as variations for multi-table INSERT
- LucidDB uses FROM T TABLESAMPLE SYSTEM(5.0) where Hive would use FROM T TABLESAMPLE(BUCKET 1 OUT OF 20 ON RAND())
- LucidDB supports fixed-point NUMERIC, DATE, TIME, and TIMESTAMP; Hive does not
- LucidDB does not suppor the MAP, ARRAY, and STRUCT types supported by Hive
- LucidDB supports almost all forms of joins, and optimizes them automatically; Hive is very restricted on its join types (e.g. equijoins only), and requires the user to manually optimize them via hints such as MAPJOIN and STREAMTABLE
- LucidDB supports both uncorrelated and (with some known bugs) correlated subqueries; the only subqueries Hive supports are those nested in the FROM clause
- LucidDB does not have a concept of partitioned or bucketed tables; in Hive, these are key concept for managing huge data volumes
- LucidDB uses the standard IN (subquery) where Hive uses the non-standard LEFT SEMI JOIN
- LucidDB does not support Hive's map/reduce extensions such as TRANSFORM, MAP, REDUCE, DISTRIBUTE BY, CLUSTER BY, and SORT BY
- fix Hive's JDBC driver to allow metadata to be retrieved from a PreparedStatement; this is required for SQL pushdown optimizations to work
- fix some multi-threading issues in Hive
- add a TableOutputFormat to Hive to allow data to be pushed into LucidDB (so that we can use multiple reducers to do seamless end-to-end parallelization for copy of query results from Hive into LucidDB)
- enhance LucidDB's pushdown support to cover all expressions
- add UDAF support to LucidDB, and write bridges to make UDF's, UDAF's, and UDX/UDTF's reusable between Hive and LucidDB
- add a function mapping feature so that builtin and user-defined functions in LucidDB can be correctly mapped to their corresponding functions in Hive during pushdown
add catalog metadata calls to Hive's JDBC driver so that we can automatically import multiple table definitions at once from Hive's metastore into LucidDB's catalogThis has been fixed with the commit of HIVE-1126
- pull Firewater and maybe HadoopDB into the picture...