LucidDbAggregateDesigner
Pentaho Aggregate Designer is a project for automatically defining and populating aggregate tables for use in accelerating Mondrian query performance. This page documents a LucidDB add-on (available for the 0.7.4 release) which integrates the Pentaho Aggregate Designer into LucidDB as a server-side procedure.
Contents |
Overview
For background on the Pentaho Aggregate Designer, see its user guide.
This diagram illustrates how the aggregate designer can be used with LucidDB, with the numbered steps explained below:
- [1] The first step, as always, is to define a warehouse schema and ETL for extracting source data and loading it into warehouse tables. At this point, a Mondrian schema can be defined, without any aggregations.
- [2] The aggregate designer UI can be run as a LucidDB client; however, the add-on allows the aggregate designer to be installed into LucidDB as a server-side system procedure, and then invoked via SQL with parameters telling it where to find the Mondrian schema definition, which algorithm to use, etc. When the add-on procedure is invoked, the aggregate designer executes its algorithm, and as each aggregation gets selected, a callback executes DDL to create the corresponding LucidDB table definition, and then executes an INSERT statement to populate it (with warehouse tables as the source of the INSERT and aggregate tables as the target). The aggregate designer also emits an augmented version of the Mondrian schema, with the selected aggregate specifications included. The add-on takes care of post-processing such as ANALYZE for the aggregate tables as well.
- [3] Mondrian can then be used with the augmented schema, with queries automatically accelerated by aggregate tables wherever possible.
Prerequisites
Before continuing, carry out the procedure in LucidDbMondrianReplication to get LucidDB set up with a copy of the FOODMART database. Be sure to follow the steps for creating a separate FOODMART user, since we'll need that in order for the aggregate designer to be able to pick up the right schema automatically.
Verify your setup by checking the loaded table row counts:
0: jdbc:luciddb:rmi://localhost> select table_name, current_row_count from sys_root.dba_stored_tables; +----------------------+--------------------+ | TABLE_NAME | CURRENT_ROW_COUNT | +----------------------+--------------------+ | store | 25 | | time_by_day | 730 | | product | 1560 | | product_class | 110 | | warehouse | 24 | | sales_fact_1997 | 86837 | | promotion | 1864 | | customer | 10281 | | inventory_fact_1997 | 4070 | | salary | 21252 | | employee | 1155 | | position | 18 | | department | 12 | | employee_closure | 7179 | | store_ragged | 25 | +----------------------+--------------------+
Install the Client
Now, let's walk through using Pentaho Aggregate Designer UI as a client against LucidDB.
- Download the aggregate designer package here and unpack it.
- Copy luciddb/plugin/LucidDbClient.jar to the aggregation-designer/drivers directory.
- Launch the aggregate designer UI (startaggregationdesigner.sh/.bat)
- Fill in the database connection info as follows:
- Connection Type: Generic database
- Access: Native (JDBC)
- Custom Connection URL: jdbc:luciddb:rmi://localhost
- Custom Driver Class Name: com.lucidera.jdbc.LucidDbRmiDriver
- User Name: FOODMART
- Test the connection
- Provide the location for FoodMart.xml
- Click Apply, choose the Sales cube, and connect
- You can ignore the validation error regarding primary keys
- Click the Advisor button, fill in 5 for max aggregates and 10 seconds for max runtime, and then click Recommend
- You should end up with a nice display like the one below:
At this point, you are supposed to be able to create the selected aggregates automatically by clicking the Export button and then the various Execute buttons. However, currently this won't work with LucidDB, because the aggregate designer attempts to start a transaction, and LucidDB does not support user-level transactions. You could export each script individually and then execute them via sqlline. (If you try this, see #Implementation Notes for some dialect search/replace needed to avoid errors.) Or...you could use the add-on to let LucidDB do everything in one procedure call, as described in the following sections.
Install The Add-on
Eventually, we would like to build the add-on to be a pure plugin for LucidDB, requiring no changes to the LucidDB installation. A few "jar hell" issues are currently hindering this, so for now, it's necessary for some of the code involved to be loaded off of the system classpath instead of via dynamic classloading. Shut down LucidDB and modify luciddb/bin/classpath.gen (or farrago/classpath.gen for a LucidDB developer build), adding the following entries (without line breaks) and changing the paths as appropriate to point to your Pentaho aggregation-designer installation:
:/aggregation-designer/lib/log4j-1.2.9.jar:/aggregation-designer/lib/mondrian-3.0.4.jar:/aggregation-designer/lib/commons-vfs-1.0.jar :/aggregation-designer/lib/commons-math-1.0.jar:/aggregation-designer/lib/javacup-10k.jar
Restart the LucidDB server; you'll see some warning noise from VJDBC complaining about log4j, but you can ignore this.
Download the luciddb-aggdes add-on and unpack it.
Next, from a sqlline client connected as sa, execute the following after adjusting the jar path:
create or replace jar applib.aggdes_jar
library 'file:/path/to/luciddb-aggdes-0.7.4/luciddb-aggdes.jar'
options(0);
create or replace procedure applib.create_mondrian_aggregates(
schema_file varchar(1024),
cube_name varchar(128),
algorithm_class varchar(128),
time_limit_seconds int,
aggregate_limit int,
output_schema_file varchar(128))
language java
modifies sql data
external name 'applib.aggdes_jar:net.sf.luciddb.aggdes.CreateMondrianAggregatesUdp.execute';
grant execute
on specific procedure applib.create_mondrian_aggregates
to foodmart;
Call The Procedure
Disconnect from sqllineClient and reconnect as user FOODMART, then invoke the procedure as follows, again adjusting paths as appropriate:
call applib.create_mondrian_aggregates(
'/path/to/open/mondrian/demo/FoodMart.xml',
'Sales',
'org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm',
300,
10,
'/path/to/output/FoodMartAgg.xml');
To see what is happening, check LucidDB's trace directory. You should see activity such as the following:
...
INFO: -- Aggregate table agg_sttpsssssss
-- Estimated 480 rows, 23040 bytes
CREATE TABLE "agg_sttpsssssss" (
"store_Store_Sqft" INTEGER,
"time_by_day_Year" SMALLINT,
"time_by_day_Quarter" VARCHAR(30),
"product_class_Product_Family" VARCHAR(30),
"sales_fact_1997_Unit_Sales" DOUBLE,
"sales_fact_1997_Store_Cost" DOUBLE,
"sales_fact_1997_Store_Sales" DOUBLE,
"sales_fact_1997_Sales_Count" INTEGER,
"sales_fact_1997_Customer_Count" INTEGER,
"sales_fact_1997_Promotion_Sale" DOUBLE,
"sales_fact_1997_fact_count" INTEGER)
...
INFO: -- Populate aggregate table agg_sttpsssssss
INSERT INTO "agg_sttpsssssss" (
"store_Store_Sqft",
"time_by_day_Year",
"time_by_day_Quarter",
"product_class_Product_Family",
"sales_fact_1997_Unit_Sales",
"sales_fact_1997_Store_Cost",
"sales_fact_1997_Store_Sales",
"sales_fact_1997_Sales_Count",
"sales_fact_1997_Customer_Count",
"sales_fact_1997_Promotion_Sale",
"sales_fact_1997_fact_count")
select
"store"."store_sqft" as "store_Store_Sqft",
"time_by_day"."the_year" as "time_by_day_Year",
"time_by_day"."quarter" as "time_by_day_Quarter",
"product_class"."product_family" as "product_class_Product_Family",
sum("sales_fact_1997"."unit_sales") as "sales_fact_1997_Unit_Sales",
sum("sales_fact_1997"."store_cost") as "sales_fact_1997_Store_Cost",
sum("sales_fact_1997"."store_sales") as "sales_fact_1997_Store_Sales",
count("sales_fact_1997"."product_id") as "sales_fact_1997_Sales_Count",
count(distinct "sales_fact_1997"."customer_id") as "sales_fact_1997_Customer_Count",
sum((case when "sales_fact_1997"."promotion_id" = 0 then 0 else "sales_fact_1997"."store_sales" end)) as "sales_fact_1997_Promotion_Sale",
count(*) as "sales_fact_1997_fact_count"
from
"sales_fact_1997" as "sales_fact_1997",
"store" as "store",
"time_by_day" as "time_by_day",
"product_class" as "product_class",
"product" as "product"
where
"sales_fact_1997"."store_id" = "store"."store_id" and
"sales_fact_1997"."time_id" = "time_by_day"."time_id" and
"sales_fact_1997"."product_id" = "product"."product_id" and
"product"."product_class_id" = "product_class"."product_class_id"
group by
"store"."store_sqft",
"time_by_day"."the_year",
"time_by_day"."quarter",
"product_class"."product_family"
...
INFO: ANALYZE TABLE "agg_sttpsssssss" ESTIMATE STATISTICS FOR ALL COLUMNS
...
You'll also see some chatter in the lucidDbServer console, since the aggdesigner command-line interface writes to System.out.
Once the procedure completes, verify that it actually created new tables:
0: jdbc:luciddb:rmi://localhost> select table_name, current_row_count from sys_root.dba_stored_tables; +----------------------+--------------------+ | TABLE_NAME | CURRENT_ROW_COUNT | +----------------------+--------------------+ | store | 25 | | time_by_day | 730 | | product | 1560 | | product_class | 110 | | warehouse | 24 | | sales_fact_1997 | 86837 | | promotion | 1864 | | customer | 10281 | | inventory_fact_1997 | 4070 | | salary | 21252 | | employee | 1155 | | position | 18 | | department | 12 | | employee_closure | 7179 | | store_ragged | 25 | | agg_sttpsssssss | 132 | | agg_sssccsssssss | 205 | | agg_sssttcsssssss | 180 | | agg_stccsssssss | 12 | | agg_tttsssssss | 12 | | agg_ttpcsssssss | 110 | | agg_ssscsssssss | 12 | | agg_pccsssssss | 28 | | agg_ttpcsssssss0 | 12 | | agg_pccsssssss0 | 6 | +----------------------+--------------------+
Test The Aggregates
Now to verify that Mondrian can actually use the generated aggregates.
Assuming you've already pointed Mondrian at LucidDB, first make sure you have disabled aggregate table usage in the Mondrian .properties file (this is the default):
mondrian.rolap.aggregates.Use=false mondrian.rolap.aggregates.Read=false
Run this MDX query:
select {[Measures].[Unit Sales]} on columns,
{[Product].[All Products]} on rows
from [Sales];
Check LucidDB's trace file, and it should be hitting the fact table:
INFO: select "time_by_day"."the_year" as "c0", sum("sales_fact_1997"."unit_sales") as "m0"
from "time_by_day" as "time_by_day", "sales_fact_1997" as "sales_fact_1997"
where "sales_fact_1997"."time_id" = "time_by_day"."time_id" and "time_by_day"."the_year" = 1997 group by "time_by_day"."the_year"
Change Mondrian properties to enable agg table usage:
mondrian.rolap.aggregates.Use=true mondrian.rolap.aggregates.Read=true
And change the Mondrian setup to use the generated FoodMartAgg.xml (instead of the original FoodMart.xml).
Run the MDX query above again:
Mondrian Error:Too many errors, '1', while loading/reloading aggregates. ? 42792 [main] ERROR mondrian.rolap.aggmatcher.AggTableManager - ExplicitRecognizer.matchLevel:Recognizer.makeLevel: Double Match for candidate aggregate table 'agg_sssccsssssss' for fact table 'sales_fact_1997' and column 'store_Store_Country' matched two hierarchies: 1) table='store_ragged', column='store_Store_Country' and 2) table='store', column='store_Store_Country' Mondrian Error:Too many errors, '1', while loading/reloading aggregates.
Oops, what happened? Well, there's a second cube [Sales Ragged] in Foodmart, and its presence is messing things up. So in FoodMartAgg.xml, delete its definition and then try again.
Now the MDX query should be hitting an agg table:
INFO: select "agg_tttsssssss"."time_by_day_Year" as "c0", sum("agg_tttsssssss"."sales_fact_1997_Unit_Sales") as "m0"
from "agg_tttsssssss" as "agg_tttsssssss"
where "agg_tttsssssss"."time_by_day_Year" = 1997 group by "agg_tttsssssss"."time_by_day_Year"
Success!
Implementation Notes
The source code for the add-on is available in perforce.eigenbase.org:
- CreateMondrianAggregatesUdp.java: entry point which converts SQL arguments into an argument array understood by the aggdesigner's command-line interface, and then calls that interface (org.pentaho.aggdes.Main.main), all within the LucidDB procedure invocation
- LucidDbAggResultHandler.java: implements the aggdesigner's ResultHandler interface by executing the generated DDL and DML commands against LucidDB in order to immediately create and populate the recommended aggregate tables
The procedure executes all DDL and DML as the same user who invoked it. For everything to work, that user must have been created with the correct default schema (FOODMART in this example). See LucidDbCreateUser for syntax.
Currently, the aggregate designer produces invalid SQL:2003 DDL for some datatypes. To work around this, the add-on performs the following post-processing on the DDL before executing it in LucidDB:
sql = sql.replaceAll("INTEGER\\(.*\\)", "INTEGER");
sql = sql.replaceAll("BIGINT\\(.*\\)", "BIGINT");
sql = sql.replaceAll("SMALLINT\\(.*\\)", "SMALLINT");
You'll need to do the same if you use the aggregate designer UI to generate scripts.
For a future release, the real fix for this needs to be applied to the org.pentaho.aggdes.model.mondrian.MondrianDialect.supportsPrecision method in the Pentaho codebase as follows:
--- src/org/pentaho/aggdes/model/mondrian/MondrianDialect.java (revision 33317)
+++ src/org/pentaho/aggdes/model/mondrian/MondrianDialect.java (working copy)
@@ -91,6 +91,6 @@
// hsqldb v1.8 returns 'HSQL Database Engine'
return
(meta.getDatabaseProductName().toUpperCase().indexOf("HSQL") < 0) &&
- (!dialect.isPostgres() || (type.toUpperCase().indexOf("CHAR") >= 0));
+ ((!dialect.isPostgres() && !dialect.isLucidDB()) || (type.toUpperCase().indexOf("CHAR") >= 0));
}
}
Improvements Needed
Here are some features which will probably be needed for heavy-duty usage patterns:
- generate index and constraint definitions for aggregate tables
- generate the population script (with pretruncation) as a separate procedure which can be called at any time after the warehouse has been refreshed
- override the default StatisticsProvider implementation to retrieve rowcount, cardinality and histogram information from LucidDB's catalog (perhaps plugging in some of the LucidDB optimizer's cost functions as well)
- similarly, override the default StatisticsProvider's getSpace and getLoadTime estimates to match LucidDB, in particular, taking the benefits of column store compression into account
- improved packaging for auto-installation of the add-on

