DataSketches |

Size and Byte Array Structures

All the sketches in the theta package share a common byte array data structure and similar strategies for managing use of memory when the sketches are instantiated on the Java heap or when instantiated in off-heap native memory.

The byte array structure has two forms Hash Table and Compact.

Update Sketch Families – Hash Table Form

The Hash Table form is similar to how the sketch is instantiated on the Java heap. Hash tables consume more space depending on how full the table is. However, updating the sketch is much faster in this form and is the default for all the Update Sketches.

There are two Update Sketch families, QuickSelect, and Alpha. The QuickSelect family has both a Java heap and a direct, off-heap variants while the Alpha sketch is designed only for operation on the Java heap.

The Update Sketches consist of an internal hash table cache and a few other primitives. How the cache is sized and/or resized can be controlled by the user specified ResizeFactor enum that has the values X1, X2, X4, and X8 (default). The meaning of these values is as follows:

X1: A Resize Factor of 1 means the cache will not increase its size. A new sketch will immediately allocate space for twice the given Nominal Entries. For example, if nomEntries is 4096 the internal cache will be allocated at 8192 entries. Each entry is 8 bytes.
X2: A Resize Factor of 2 means the cache may increase the cache size by factors of 2. A new sketch will initially allocate a very small cache (typically enough for 16 entries), and, when required, will increase the cache size by factors of 2 until the maximum cache size is reached, which is twice the Nominal Entries.
X4: A Resize Factor of 4 is similar to X2, except the resizing factor is 4.
X8: A Resize Factor of 8 is similar to X2, except the resizing factor is 8. This is the default.

Resizing a cache is costly as a new array needs to be allocated and the hash table rebuilt. Depending on the application, the user can select the Resize Factor to trade-off memory size and overall update speed performance.

When an Update Sketch is converted to a byte array using toByteArray(), the internal structure is a 24-byte preamble followed by a non-contiguous hash table array of 8-byte, long data entries. This enables the sketch to be quickly reconstructed from the byte array so that updating of the sketch can be continued.

Compact Sketch Family – Compact Form

Once the updating of a sketch is completed the HT is no longer needed, so the sketch can be stored in a compact form. The size of this compact form is a simple function of the number of retained hash values (8 bytes) and a small preamble that varies from 8 to 24 bytes depending on the internal state of the sketch. An empty sketch is represented by only 8 bytes. The upper limit of the size of the sketch varies by the type of sketch but is in the range of 8k to 16k. k is the configured size of the sketch in nominal entries, and also determines the accuracy of the sketch.

Compact Sketches optimize memory usage and sketch merge performance, are inherently read only and can only be created from an existing Update Sketch or Set Operation (Union, Intersection, or AnotB) object. The internal structure of Compact Sketches is an 8, 16, or 24 byte preamble followed by a contiguous data array of 8-byte, long data entries. This data array can be either ordered or unordered. This data structure is the same whether the Compact Sketch is instantiated on the Java heap or in off-heap direct memory. An empty Compact Sketch consumes only 8 bytes.

The ordered form is ideal for systems environments where the building of the sketches from data occur in an offline system (like a Hadoop grid), then compacted, ordered and uploaded to a real-time query engine (like Druid) where the compact sketches can be quickly merged to satisfy end-user queries. The ordered form enables early stop enhancements to the merge algorithm that makes the merging extreemly fast (~ 10’s of millions of sketches per second).

The unordered form is more desirable for systems environments where the building of the sketches from data occur in real-time and queried in real-time. In this environment there is no need to pay the cost of the sort.

Thus, the choice of ordered or unordered is a tradeoff between real-time sketch build & getEstimate() performance and offline sketch-build and real-time merge performance.

The Compact Sketch is created four different ways selected by the ordering preference (dstOrdered) and whether the Compact Sketch will reside on-heap or off-heap (dstMemory).

Unordered, On-heap
- Update Sketch: compact(false, null)
- Set Operation: getResult(false, null)
Ordered, On-heap
- Update Sketch: compact(true, null)
- Set Operation: getResult(true, null)
Unordered, Off-heap
- Update Sketch: compact(false, Memory mem)
- Set Operation: getResult(false, Memory mem)
Ordered, Off-heap
- Update Sketch: compact(true, Memory mem)
- Set Operation: getResult(true, Memory mem)

Compact Sketch Sizing

These tables compute the size of a sketch after it has been converted into Compact Form. The size of a sketch during the build phase is explained above as the sketch starts small and resizes by the configurable Resize Factor up to the in-memory size of 2k*8 bytes plus a few primitives.

Note: a sketch entry = 8 bytes.

Quick Select Sketch (Default)

The number of valid entries in the Quick Select Sketch after it enters estimation mode statistically varies from k to 15k/8 with an average of about 3k/2. It is a user option to force a rebuild() prior to compacting the sketch in which case the number of valid entries is never larger than k.

	Empty	After Rebuild()	Estimating Avg	Estimating Max
Nominal Entries (k) : Formula ->	8	*k8 +24**	*k12 + 24**	*k15 + 24**
16	8	152	216	264
32	8	280	408	504
64	8	536	792	984
128	8	1,048	1,560	1,944
256	8	2,072	3,096	3,864
512	8	4,120	6,168	7,704
1,024	8	8,216	12,312	15,384
2,048	8	16,408	24,600	30,744
4,096	8	32,792	49,176	61,464
8,192	8	65,560	98,328	122,904
16,384	8	131,096	196,632	245,784
32,768	8	262,168	393,240	491,544
65,536	8	524,312	786,456	983,064
131,072	8	1,048,600	1,572,888	1,966,104
262,144	8	2,097,176	3,145,752	3,932,184
524,288	8	4,194,328	6,291,480	7,864,344
1,048,576	8	8,388,632	12,582,936	15,728,664

Alpha Sketch

The number of valid entries in the Alpha Sketch after it enters estimation mode is a random variable that has a probability distribution with a mean of k and a standard deviation of sqrt(k). The last column computes the maximum size with a confidence of 99.997% representing plus 4 standard deviations.

	Empty	Estimating Avg	Std Dev	Max @ 99.997%
Nominal Entries (k) : Formula ->	8	*k8 + 24**	sqrt(k)	*(k+4SD)8 +24**
512	8	4,120	23	4,844
1,024	9	8,216	32	9,240
2,048	10	16,408	45	17,856
4,096	11	32,792	64	34,840
8,192	12	65,560	91	68,456
16,384	13	131,096	128	135,192
32,768	14	262,168	181	267,961
65,536	15	524,312	256	532,504
131,072	16	1,048,600	362	1,060,185
262,144	17	2,097,176	512	2,113,560
524,288	18	4,194,328	724	4,217,498
1,048,576	19	8,388,632	1,024	8,421,400

Set Operation Family

Union

The Union operation has both a Java heap and a direct, off-heap variant. When a Union operation is converted to a byte array using toByteArray(), the internal structure is a 32-byte preamble followed by a non-contiguous hash table array of 8-byte, long data entries. This enables the Union to be quickly reconstructed from the byte array so that updating of the Union can be continued.

Intersection

The Intersection operation has both a Java heap and a direct, off-heap variant. When an Intersection operation is converted to a byte array using toByteArray(), the internal structure is a 24-byte preamble followed by a non-contiguous hash table array of 8-byte, long data entries. This enables the Intersection to be quickly reconstructed from the byte array so that updating of the Intersection can be continued.

A not B

The A not B operation is asymmetric and stateless. Both the A and B arguments are presented and the difference is computed and returned. There is no need for a byte array form.