API Latest Releases: Java Core, C++ Core, Python, Memory, Pig, Hive,

Quantiles StreamA Study

The goal of this article is to compare the accuracy performance of the DataSketches Quantiles Sketch to an exact, brute-force computation using actual data extracted from one of our back-end servers.

Please get familiar with the Tutorial for quantiles.

Versions

The Paper

The implementation of this Quantiles Sketch was originally inspired by Mergable Summaries, PODS, May, 2012 paper by Agarwal, Cormode, Huang, Phillips, Wei, and Yi.

The Input Data

The data file used for this evaluation, streamA.txt, is real data extracted from one of our back-end servers. It represents one hour of web-site time-spent data measured in milliseconds. The data in this file has a smooth and well-behaved value distribution with a wide dynamic range. It is a text file and consists of consecutive strings of numeric values separated by a line-feeds. Its size is about 2GB.

Creating the Exact CDF and Histograms For Comparison

In order to measure the accuracy of the Approximate Histogram, we need an exact, brute-force analysis of streamA.txt.

The process for creating these comparison standards can be found here.

The Results

The DataSketches Quantiles Sketch provides a very straightforward tradeoff between sketch size and accuracy as the next two examples show.

K = 256, Size = 21408 bytes, a priori Accuracy = +/- 0.717%

The CDF plot:

DataSketches Quantiles Sketch StreamA CDF of ranks to quantiles

The green dots in the above plot represents the Exact cumulative distribution (CDF) of ranks to quantile values. The red circles represent the values returned from the DS Quantiles Sketch getQuantiles(double[]) function.

The plot reveals a very tight fit to the exact quantiles over the full range. The thin black and blue lines just to the left and right of the plotted points represent the lower-bound and upper-bound error derived from the sketch’s getLowerBound() and getUpperBound() functions.

The Histogram plot:

The next plot is the Histogram generated from the values returned by the getPMF(double[]) function. Each of the returned values is multiplied by getN() to produce the respecive mass of each bin. The Green bars represent the Exact Distribution, and the Orange bars represent the results obtained from the DS Quantiles sketch. You can see that they match pretty well.

DataSketches Quantiles Histogrm vs Exact

K = 32, Size = 3232 bytes, a priori Accuracy = +/- 5.4%

The CDF plot:

Note the wider interval between the upper and lower bounds curves:

DataSketches Quantiles Sketch StreamA CDF of ranks to quantiles

The Histogram plot:

DataSketches Quantiles Histogrm vs Exact

A little noiser but still tracks the exact shape pretty well:

The Evaluation Source

The following are used to create the plots above.

Run the above profilers as a java application and supply the config file as the single argument. The program will check if the data file has been unzipped, and if not it will unzip it for you.