Streaming quantiles algorithms, or quantiles sketches, enable us to analyze the distributions
of massive data very quickly using only a small amount of space.
They allow us to compute quantile values given a desired rank, or compute a rank given
a quantile value. Quantile sketches enable us to plot the CDF, PMF or histograms of a distribution.
The goal of this short tutorial it to introduce the reader to some of the basic concepts of quantiles, ranks and their functions.
The actual enumeration can be done in several ways, but for our use here we will define the two common ways that rank can be specified and that we will use.
The natural rank is a natural number from the set of one-based, natural numbers, ℕ1, and is derived by enumerating an ordered set of values, starting with the value 1, up to n, the number of values in the original set.
In our sketch library and documentation, when we refer to rank, we imply normalized rank. However, in this tutorial, we will sometimes use natural ranks to simplify the examples.
Normalized rank is closely associated with the concept of mass. The value associated with the rank 0.5 represents the median value, or the center of mass of the entire set, where half of the values are below the median and half are above. The concept of mass is important to understanding the Probability Mass Function (PMF) offered by all the quantile sketches in the library.
Quantile is the general term that includes other terms that are also quantiles. To wit:
Let’s examine the following table:
Quantile: | 10 | 20 | 30 | 40 | 50 |
---|---|---|---|---|---|
Natural Rank | 1 | 2 | 3 | 4 | 5 |
Normalized Rank | .2 | .4 | .6 | .8 | 1.0 |
The term “value” can be ambiguous because items that we stream into a sketch are values and numeric ranks are also values. To avoid this ambiguity, we will use the term “quantiles” to refer to values that are streamed into a sketch even before they have been associated with a rank.
Let’s define the simple functions
Using an example from the table:
Because of the close, two-way relationship of quantiles and ranks,
r(q) and q(r) form a 1:1 functional pair if, and only if
And this is certainly true of the table above.
With real data we often encounter duplicate quantiles in the stream. Let’s examine this next table.
Quantile: | 10 | 20 | 20 | 20 | 50 |
---|---|---|---|---|---|
Natural Rank | 1 | 2 | 3 | 4 | 5 |
As you can see q(r) is straightforward. But how about r(q)? Which of the ranks 2, 3, or 4 should the function return, given the quantile 20? Given this data, and our definitions so far, the function r(q) is ambiguous. We will see how to resolve this shortly.
By definition, sketching algorithms are approximate, and they achieve their high performance by discarding data. Suppose you feed n quantiles into a sketch that retains only m < n quantiles. This means n-m quantiles were discarded. The sketch must track the quantity n used for computing the rank and quantile functions. When the sketch reconstructs the relationship between ranks and quantiles, n-m quantiles are missing creating holes in the ordered sequence. For example, examine the following tables.
The raw data might look like this, with its associated natural ranks.
Quantile: | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 |
---|---|---|---|---|---|---|---|---|---|---|
Natural Rank | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
The sketch might discard the even numbered quantiles producing something like this:
Quantile: | 10 | 30 | 50 | 70 | 90 |
---|---|---|---|---|---|
Natural Rank | 2 | 4 | 6 | 8 | 10 |
When the sketch deletes quantiles it adjusts the associated ranks by effectively increasing the “weight” of adjacent quantiles so that they are approximately positionally correct and the top natural rank corresponds to n.
How do we resolve q(3) or r(20)?
The quantile sketch algorithms discussed in the literature primarily differ by how they choose which quantiles in the stream should be discarded. After the elimination process, all of the quantiles sketch implementations are left with the challenge of how to reconstruct the actual distribution, approximately and with good accuracy.
Given the presence of duplicates and absence of values from the stream we must redefine the above quantile and rank functions as inequalities while retaining the properties of 1:1 functions.
One can find examples of the following definitions in the research literature. All of our library quantile sketches allow the user to choose the searching criteria.
These next examples use a small data set that mimics what could be the result of both duplication and sketch data deletion.
Rule 1: Every quantile that exists in the input stream or retained by the sketch has an associated rank.
Rule 2: All of our quantile sketches only retain quantiles that exist in the actual input stream of quantiles.
Rule 3: For the getQuantile(rank) queries, all of our quantile sketches only return quantiles that were retained by the sketch. (i.e, we do not interpolate between quantiles.)
Rule 4: For the getRank(quantile) queries, all of our quantile sketches only return ranks that are associated with quantiles retained by the sketch. (i.e, we do not interpolate between ranks.)
Rule 5: All of our quantile algorithms compensate for quantiles removed during the sketch quantile selection and compression process by increasing the weights of some of the quantiles not selected for removal, such that:
Implementation:
Boundary Exceptions:
Examples using normalized ranks:
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 |
Quantile input | 30 | |||||||
Qualifying pair | q1 | q2 | ||||||
Rank result | .786 |
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 | ? |
---|---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 | |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 | |
Quantile input | 55 | ||||||||
Qualifying pair | q1 | (q2) | |||||||
Rank result | 1.0 |
Quantile[]: | ? | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 | |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 | |
Quantile input | 5 | ||||||||
Qualifying pair | (q1) | q2 | |||||||
Rank result | 0 |
Implementation:
Boundary Exceptions:
Examples using normalized ranks:
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.000 |
Quantile input | 30 | |||||||
Qualifying pair | q1 | q2 | ||||||
Rank result | .357 |
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 | ? |
---|---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 | |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 | |
Quantile input | 55 | ||||||||
Qualifying pair | q1 | (q2) | |||||||
Rank result | 1.000 |
Quantile[]: | ? | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 | |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 | |
Quantile input | 5 | ||||||||
Qualifying pair | (q1) | q2 | |||||||
Rank result | 0 |
Implementation:
Boundary Exceptions:
Examples using normalized ranks:
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.000 |
Rank input | .786 | |||||||
Qualifying pair | r1 | r2 | ||||||
Quantile result | 30 |
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 |
Rank input | 1.0 | |||||||
Qualifying pair | r1 | r2 | ||||||
Quantile result | 50 |
Quantile[]: | ? | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 | |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 | |
Rank input | 0.0 | ||||||||
Qualifying pair | (r1) | r2 | |||||||
Rank result | 10 |
Implementation:
Boundary Exceptions:
Examples using normalized ranks:
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.000 |
Rank input | .357 | |||||||
Qualifying pair | r1 | r2 | ||||||
Quantile result | 30 |
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 | ? |
---|---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 | |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 | |
Rank input | 1.0 | ||||||||
Qualifying pair | r1 | (r2) | |||||||
Quantile result | 50 |
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 |
Rank input | .99 | |||||||
Qualifying pair | r1 | r2 | ||||||
Quantile result | 50 |
Quantile[]: | ? | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 | |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 | |
Rank input | 0.0 | ||||||||
Qualifying pair | (r1) | r2 | |||||||
Rank result | 10 |
Implementation:
Boundary Exceptions:
Examples using normalized ranks:
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.000 |
Rank input | .357 | |||||||
Qualifying pair | r1 | r2 | ||||||
Quantile result | 30 |
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 | ? |
---|---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 | |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 | |
Rank input | 1.0 | ||||||||
Qualifying pair | r1 | (r2) | |||||||
Quantile result | NaN or null |
Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 |
Rank input | .99 | |||||||
Qualifying pair | r1 | r2 | ||||||
Quantile result | 50 |
Quantile[]: | ? | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
---|---|---|---|---|---|---|---|---|---|
Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 | |
Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 | |
Rank input | 0.0 | ||||||||
Qualifying pair | (r1) | r2 | |||||||
Rank result | 10 |
The power of these inequality search algorithms is that they produce repeatable and accurate results, are insensitive to duplicates and sketch deletions, and maintain the property of 1:1 functions.