Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Sign In Apply for Membership

Mailing list users (luciddb-users) can "claim" their threads in the archives by registering with the same account they used to post to the email list. NOTE: List threads are updated once every 24hrs.
Performance Scalability Question Observation
  • kimberlad June 2012
    I've been using Luciddb for a short while now and its been a very positive experience so farand I will be a) hoping that its adoption grows and b) watching with interest as its evolves. However there is one are that puzzles me and that is the performance of aggregations. My understanding and experience to date (Vertica, infobright  and  SADAS)  of Column Orientated Databases is that they are designed to optimize aggregating as they are able to pre aggregate values and do not have to move as much data for computation for this there is a penalty for row look up type queries and updates. However in Luciddb it seems that row look ups perform  very fast, but aggregations not as well.  For example:

    VMWARE ESX VM server with Dedicate SAN, 2 CPU's and 6GB of RAM
    Header table with 100K of records
    Detail table with 330 Million Records
    Hard disk space used in Luciddb 20GB
    Remote client is Squirrel


    select count(1)
    from  fpds."all_build_detail_p" ;

    330205685

    Query 1 of 1, Rows read: 1, Elapsed time (seconds) - Total: 34.587, SQL query: 0.222, Building output: 34.365


    select count(1), sum(evtDuration)/1000, sum(evtSubTypeDuration)/1000
    from  fpds."all_build_detail_p"
    where evttype in ('ExecutionEvent');

    13861793    14,281,035.285    70,463,672.29

    Query 1 of 3, Rows read: 1, Elapsed time (seconds) - Total: 92.519, SQL query: 0.027, Building output: 92.492

    select count(1), sum(evtDuration)/1000, sum(evtSubTypeDuration)/1000
    from  fpds."all_build_detail_p"
    where evtsubtype in ('MojoSucceeded')
    and evttype in ('ExecutionEvent');

    5191339    12,283,058.055    20,734,403.311

    Query 2 of 3, Rows read: 1, Elapsed time (seconds) - Total: 34.0uild4, SQL query: 0.022, Building output: 34.018

    select count(1), sum(evtDuration)/1000, sum(evtSubTypeDuration)/1000
    from fpds."all_build_summary_p" t1
    INNER JOIN fpds."all_build_detail_p" t2 ON t1.KEY0 = t2.KEY0
    INNER JOIN TABLE( APPLIB.TIME_DIMENSION(
      2012, 1, 1, 2014, 1, 1, 1) ) t3 ON t3.TIME_KEY = APPLIB.CHAR_TO_DATE ('yyyy-M-d',startdate)
    where builduser in ('buildadm')
    and (workingDirectory  like '%qabuild%')
    and modulename not in ('fixJar','fixEAR')

    39619134    5,339,048.644    18,429,793.865

    Query 1 of 1, Rows read: 1, Elapsed time (seconds) - Total: 300.853, SQL query: 0.672, Building output: 300.181

    Columns are indexed. All data seems to in memeory as the CPU sits at 98% to 100% on one CPU.

    Now fair enough may be that's just a big lump of data for my VM, but then I run

    select MONTH_NAME, WEEK_NUMBER_IN_YEAR, FISCAL_WEEK_START_DATE, hostname, builduser, workingdirectory, branch, modulename, evtsubtype, buildtype, buildstatus, evtprojectID, evtmojoID, evtDuration, evtDuration/1000/60 As evtMinutes, evtSubTypeDuration, evtSequence, evtRawTimestamp, evtTimestamp, t1.profilesDocId
    from fpds."all_build_summary_p" t1
    INNER JOIN fpds."all_build_detail_p" t2 ON t1.KEY0 = t2.KEY0
    INNER JOIN TABLE( APPLIB.TIME_DIMENSION(
      2012, 1, 1, 2014, 1, 1, 1) ) t3 ON t3.TIME_KEY = APPLIB.CHAR_TO_DATE ('yyyy-M-d',startdate)
    where builduser in ('buildadm')
    and (workingDirectory  like '%qabuild%')
    and modulename not in ('fixJar','fixEAR')
    and  t1.profilesDocId in ('07f04417-20fe-4f66-a902-0ec1dc393633')
    and evtSequence between 21279 -5 and 21279 +5
    order by t1.profilesDocId, evtSequence

    Query 1 of 1, Rows read: 0, Elapsed time (seconds) - Total: 0.318, SQL query: 0.083, Building output: 0.235

    So a bit confused that an Analytics Engine built from the ground up seems to  struggle with aggregations  vs row look ups or am i missing something or doing something wrong?

    Thanks

    Mike

  • nagoodmannagoodman June 2012
    Mike,

    Thanks for putting LucidDB through your tests.

    What's you're seeing is the benefits (and drawbacks) of LucidDB's choice to keep column store segments as the core storage but also include a traditional concept of indexes.  The reason you're seeing the highly selective queries (with specific where clauses) perform faster is that LucidDB is reducing the column store segments that could potentially have the rows to a bare minimum.  ie, reduce to only possibles then scan/query on actual column store segments.  In general, LucidDB does better than most column stores in selective queries.  Notice the perf infobright vs LucidDB in these charts: http://pentahomusings.blogspot.com/2010/12/my-very-dodgy-col-store-database.html Selectivity is key in the difference between LucidDB and other column stores.

    So, assuming you're sold on why LucidDB performs so well on the queries that are selective because we've actually kept things like IDXes around for good measure, on to your question about performance on larger datasets.

    The other column stores are much better on "late materialization."  LucidDB, while it keeps the rows compressed, tiny, and split into different columns when it comes query time we reverse all that beauty to return it to an original row and THEN do the SQL on it.  In other words, we take a 30MB column and turn it into 300million rows with the original values (in memory perhaps) and then do the SQL on those "exploded" data structures.  This is what makes Farrago/Fennel a general purpose pluggable framework.  Other systems will keep the data in it's compressed format and even have operators that work on the compressed structures instead of "blowing it up" into rows.

    So, it's possible you're doing ZERO I/O (small columns w/ 300 million rows) but LucidDB is reading those small segments, turning them (in memory and streaming though) into millions of rows to iterate over (and perform SQL on) which is CPU/memory intensive.  Hence why you're seeing light I/O and heavy CPU.

    Late materialization is hard; Stonebraker has some talks on this.  LucidDB as an open source project started with fully functional column *storage* first but we've never gotten contributions/work on building operators that do more SQL on the compressed segments.  As such, LucidDB remains a great implementation of column storage, a very flexible and robust SQL implementation, but is not the fastest DB in the race.

    You can look to Firewater to spread the load over more CPUs (experimental but works) so that you can get some MPP in there to reduce the overall query latency.  Other than that, improving LucidDB speed on the larger dataset queries you posted above requires quite a bit of engineering work (man years).

    Hope this explanation was helpful, if not encouraging (re: our speed).

  • kimberlad June 2012
    Thanks for taking the time to put together this detailed response its much appreciated and makes things clearer.

    Mike
  • nagoodmannagoodman June 2012
    Mike you're welcome!