TableSampling

From LucidDB Wiki
Jump to: navigation, search

Syntax

SQL2003 now supports a sampling clause on SELECT; see Section 7.6 of Part 2:

table-factor ::= table-primary [ sample-clause ]
sample-clause ::= TABLESAMPLE sample-method left-paren sample-percentage right-paren [ repeatable-clause ]
sample-method ::= BERNOULLI | SYSTEM
repeatable-clause ::= REPEATABLE left-paren repeat-argument right-paren
sample-percentage ::= numeric-value-expression
repeat-argument ::= numeric-value-expression

Note: We also use an extension of this syntax to implement sample dataset substitution.

Generic Implementation

The TABLESAMPLE keyword can be used with tables, views and subqueries. It is implemented as a logical RelNode called SamplingRel. SamplingRel accepts a single input, and encapsulates the sampling parameters:

  • Mode: BERNOULLI or SYSTEM
  • Rate: The approximate percentage of rows returned in the sample
  • Repeatability: Whether or not the sample should be repeatable
  • Repeatable seed: A seed value to use for producing a repeatable sample.

In Farrago, a sample is only repeatable if the following conditions are met:

  1. The underlying table contents have not changed since the last time the query was issued.
  2. The underlying subquery, if any, produces deterministic output.
  3. TableStatistics have not changed since the last time the query was issued (thereby altering the query plan)
  4. The REPEATABLE keyword was used and the same seed is provided each time the query is issued.

By default, Farrago silently converts requests for SYSTEM sampling to BERNOULLI sampling. SamplingRel is converted to a FennelBernoulliSamplingRel via the FennelBernoulliSamplingRule. FennelBernoulliSamplingRel produces a FemBernoulliSamplingStreamDef which is converted into a fennel::BernoulliSamplingExecStream.

Specific data servers may provide alternate rules to implement SamplingRel. The steps to accomplish this for Fennel-based data servers are roughly:

  1. Implement ExecStream changes and add necessary ExecStreamParam fields. (It may be necessary to write an entirely new ExecStream for sampling if an existing one cannot be adapted.)
  2. Add a new StreamDef to the Farrago metamodel.
  3. Add a new RelNode to represent the new physical implementation of sampling.
  4. Add an optimizier rule to convert SamplingRel with an appropriate child into the new RelNode. For example, a rule might convert a SamplingRel with an FtrsIndexScanRel child into an FtrsSamplingScanRel.
  5. Add unit tests, as necessary.


Future Directions

Future enhancements include

  • An implementation of SYSTEM sampling for FTRS. The primary issue is determining a way to reduce the number of pages read from disk during a table scan. Some possibilities include:
    1. Implement Bernoulli-style sampling at the page, rather than row level. Before reading each page from disk during a full table scan with sampling, flip a coin and read only a subset of the pages.
    2. Implement Vitter's Algorithm Z to choose rows from the table. (Vitter, J. S. Random sampling with a reservoir. In ACM Transactions on Mathematical Software, Vol. 11, No. 1, March 1985.)
  • Once high-performance sampling is available, a set of rules to manipulate the position of SamplingRel would be useful when sampling subqueries. For instance, one might push a SamplingRel that was applied to a view or subquery down towards the unerlying TableAccessRel so that SYSTEM sampling can be used.
  • Some sampling performance may be gained by incorporating the Bernoulli sampling logic from BernoulliSamplingExecStream directly into an FTRS (or other) table access XO. This reduces the the number of rows produced by the table access XO, which can reduce subsequent processing effort.
Personal tools
Product Documentation