Mailing list users (luciddb-users) can "claim" their threads in the archives by registering with the same account they used to post to the email list. NOTE: List threads are updated once every 24hrs.
I've been using Luciddb for a short while now and its been a very positive experience so farand I will be a) hoping that its adoption grows and b) watching with interest as its evolves. However there is one are that puzzles me and that is the performance of aggregations. My understanding and experience to date (Vertica, infobright and SADAS) of Column Orientated Databases is that they are designed to optimize aggregating as they are able to pre aggregate values and do not have to move as much data for computation for this there is a penalty for row look up type queries and updates. However in Luciddb it seems that row look ups perform very fast, but aggregations not as well. For example:
VMWARE ESX VM server with Dedicate SAN, 2 CPU's and 6GB of RAM Header table with 100K of records Detail table with 330 Million Records Hard disk space used in Luciddb 20GB Remote client is Squirrel
select count(1) from fpds."all_build_detail_p" ;
330205685
Query 1 of 1, Rows read: 1, Elapsed time (seconds) - Total: 34.587, SQL query: 0.222, Building output: 34.365
select count(1), sum(evtDuration)/1000, sum(evtSubTypeDuration)/1000 from fpds."all_build_detail_p" where evttype in ('ExecutionEvent');
13861793 14,281,035.285 70,463,672.29
Query 1 of 3, Rows read: 1, Elapsed time (seconds) - Total: 92.519, SQL query: 0.027, Building output: 92.492
select count(1), sum(evtDuration)/1000, sum(evtSubTypeDuration)/1000 from fpds."all_build_detail_p" where evtsubtype in ('MojoSucceeded') and evttype in ('ExecutionEvent');
5191339 12,283,058.055 20,734,403.311
Query 2 of 3, Rows read: 1, Elapsed time (seconds) - Total: 34.0uild4, SQL query: 0.022, Building output: 34.018
select count(1), sum(evtDuration)/1000, sum(evtSubTypeDuration)/1000 from fpds."all_build_summary_p" t1 INNER JOIN fpds."all_build_detail_p" t2 ON t1.KEY0 = t2.KEY0 INNER JOIN TABLE( APPLIB.TIME_DIMENSION( 2012, 1, 1, 2014, 1, 1, 1) ) t3 ON t3.TIME_KEY = APPLIB.CHAR_TO_DATE ('yyyy-M-d',startdate) where builduser in ('buildadm') and (workingDirectory like '%qabuild%') and modulename not in ('fixJar','fixEAR')
39619134 5,339,048.644 18,429,793.865
Query 1 of 1, Rows read: 1, Elapsed time (seconds) - Total: 300.853, SQL query: 0.672, Building output: 300.181
Columns are indexed. All data seems to in memeory as the CPU sits at 98% to 100% on one CPU.
Now fair enough may be that's just a big lump of data for my VM, but then I run
select MONTH_NAME, WEEK_NUMBER_IN_YEAR, FISCAL_WEEK_START_DATE, hostname, builduser, workingdirectory, branch, modulename, evtsubtype, buildtype, buildstatus, evtprojectID, evtmojoID, evtDuration, evtDuration/1000/60 As evtMinutes, evtSubTypeDuration, evtSequence, evtRawTimestamp, evtTimestamp, t1.profilesDocId from fpds."all_build_summary_p" t1 INNER JOIN fpds."all_build_detail_p" t2 ON t1.KEY0 = t2.KEY0 INNER JOIN TABLE( APPLIB.TIME_DIMENSION( 2012, 1, 1, 2014, 1, 1, 1) ) t3 ON t3.TIME_KEY = APPLIB.CHAR_TO_DATE ('yyyy-M-d',startdate) where builduser in ('buildadm') and (workingDirectory like '%qabuild%') and modulename not in ('fixJar','fixEAR') and t1.profilesDocId in ('07f04417-20fe-4f66-a902-0ec1dc393633') and evtSequence between 21279 -5 and 21279 +5 order by t1.profilesDocId, evtSequence
Query 1 of 1, Rows read: 0, Elapsed time (seconds) - Total: 0.318, SQL query: 0.083, Building output: 0.235
So a bit confused that an Analytics Engine built from the ground up seems to struggle with aggregations vs row look ups or am i missing something or doing something wrong?
What's you're seeing is the benefits (and drawbacks) of LucidDB's choice to keep column store segments as the core storage but also include a traditional concept of indexes. The reason you're seeing the highly selective queries (with specific where clauses) perform faster is that LucidDB is reducing the column store segments that could potentially have the rows to a bare minimum. ie, reduce to only possibles then scan/query on actual column store segments. In general, LucidDB does better than most column stores in selective queries. Notice the perf infobright vs LucidDB in these charts: http://pentahomusings.blogspot.com/2010/12/my-very-dodgy-col-store-database.html Selectivity is key in the difference between LucidDB and other column stores.
So, assuming you're sold on why LucidDB performs so well on the queries that are selective because we've actually kept things like IDXes around for good measure, on to your question about performance on larger datasets.
The other column stores are much better on "late materialization." LucidDB, while it keeps the rows compressed, tiny, and split into different columns when it comes query time we reverse all that beauty to return it to an original row and THEN do the SQL on it. In other words, we take a 30MB column and turn it into 300million rows with the original values (in memory perhaps) and then do the SQL on those "exploded" data structures. This is what makes Farrago/Fennel a general purpose pluggable framework. Other systems will keep the data in it's compressed format and even have operators that work on the compressed structures instead of "blowing it up" into rows.
So, it's possible you're doing ZERO I/O (small columns w/ 300 million rows) but LucidDB is reading those small segments, turning them (in memory and streaming though) into millions of rows to iterate over (and perform SQL on) which is CPU/memory intensive. Hence why you're seeing light I/O and heavy CPU.
Late materialization is hard; Stonebraker has some talks on this. LucidDB as an open source project started with fully functional column *storage* first but we've never gotten contributions/work on building operators that do more SQL on the compressed segments. As such, LucidDB remains a great implementation of column storage, a very flexible and robust SQL implementation, but is not the fastest DB in the race.
You can look to Firewater to spread the load over more CPUs (experimental but works) so that you can get some MPP in there to reduce the overall query latency. Other than that, improving LucidDB speed on the larger dataset queries you posted above requires quite a bit of engineering work (man years).
Hope this explanation was helpful, if not encouraging (re: our speed).