API Snapshots: Java Core, C++ Core, Python, Memory, Pig, Hive,

Size and Byte Array Structures

All the sketches in the theta package share a common byte array data structure and similar strategies for managing use of memory when the sketches are instantiated on the Java heap or when instantiated in off-heap native memory.

The byte array structure has two forms Hash Table and Compact.

Update Sketch Families – Hash Table Form

The Hash Table form is similar to how the sketch is instantiated on the Java heap. Hash tables consume more space depending on how full the table is. However, updating the sketch is much faster in this form and is the default for all the Update Sketches.

There are two Update Sketch families, QuickSelect, and Alpha. The QuickSelect family has both a Java heap and a direct, off-heap variants while the Alpha sketch is designed only for operation on the Java heap.

The Update Sketches consist of an internal hash table cache and a few other primitives. How the cache is sized and/or resized can be controlled by the user specified ResizeFactor enum that has the values X1, X2, X4, and X8 (default). The meaning of these values is as follows:

  • X1: A Resize Factor of 1 means the cache will not increase its size. A new sketch will immediately allocate space for twice the given Nominal Entries. For example, if nomEntries is 4096 the internal cache will be allocated at 8192 entries. Each entry is 8 bytes.
  • X2: A Resize Factor of 2 means the cache may increase the cache size by factors of 2. A new sketch will initially allocate a very small cache (typically enough for 16 entries), and, when required, will increase the cache size by factors of 2 until the maximum cache size is reached, which is twice the Nominal Entries.
  • X4: A Resize Factor of 4 is similar to X2, except the resizing factor is 4.
  • X8: A Resize Factor of 8 is similar to X2, except the resizing factor is 8. This is the default.

Resizing a cache is costly as a new array needs to be allocated and the hash table rebuilt. Depending on the application, the user can select the Resize Factor to trade-off memory size and overall update speed performance.

When an Update Sketch is converted to a byte array using toByteArray(), the internal structure is a 24-byte preamble followed by a non-contiguous hash table array of 8-byte, long data entries. This enables the sketch to be quickly reconstructed from the byte array so that updating of the sketch can be continued.

Compact Sketch Family – Compact Form

Once the updating of a sketch is completed the HT is no longer needed, so the sketch can be stored in a compact form. The size of this compact form is a simple function of the number of retained hash values (8 bytes) and a small preamble that varies from 8 to 24 bytes depending on the internal state of the sketch. An empty sketch is represented by only 8 bytes. The upper limit of the size of the sketch varies by the type of sketch but is in the range of 8k to 16k. k is the configured size of the sketch in nominal entries, and also determines the accuracy of the sketch.

Compact Sketches optimize memory usage and sketch merge performance, are inherently read only and can only be created from an existing Update Sketch or Set Operation (Union, Intersection, or AnotB) object. The internal structure of Compact Sketches is an 8, 16, or 24 byte preamble followed by a contiguous data array of 8-byte, long data entries. This data array can be either ordered or unordered. This data structure is the same whether the Compact Sketch is instantiated on the Java heap or in off-heap direct memory. An empty Compact Sketch consumes only 8 bytes.

The ordered form is ideal for systems environments where the building of the sketches from data occur in an offline system (like a Hadoop grid), then compacted, ordered and uploaded to a real-time query engine (like Druid) where the compact sketches can be quickly merged to satisfy end-user queries. The ordered form enables early stop enhancements to the merge algorithm that makes the merging extreemly fast (~ 10’s of millions of sketches per second).

The unordered form is more desirable for systems environments where the building of the sketches from data occur in real-time and queried in real-time. In this environment there is no need to pay the cost of the sort.

Thus, the choice of ordered or unordered is a tradeoff between real-time sketch build & getEstimate() performance and offline sketch-build and real-time merge performance.

The Compact Sketch is created four different ways selected by the ordering preference (dstOrdered) and whether the Compact Sketch will reside on-heap or off-heap (dstMemory).

  • Unordered, On-heap
    • Update Sketch: compact(false, null)
    • Set Operation: getResult(false, null)
  • Ordered, On-heap
    • Update Sketch: compact(true, null)
    • Set Operation: getResult(true, null)
  • Unordered, Off-heap
    • Update Sketch: compact(false, Memory mem)
    • Set Operation: getResult(false, Memory mem)
  • Ordered, Off-heap
    • Update Sketch: compact(true, Memory mem)
    • Set Operation: getResult(true, Memory mem)

Compact Sketch Sizing

These tables compute the size of a sketch after it has been converted into Compact Form. The size of a sketch during the build phase is explained above as the sketch starts small and resizes by the configurable Resize Factor up to the in-memory size of 2k*8 bytes plus a few primitives.

Note: a sketch entry = 8 bytes.

Quick Select Sketch (Default)

The number of valid entries in the Quick Select Sketch after it enters estimation mode statistically varies from k to 15k/8 with an average of about 3k/2. It is a user option to force a rebuild() prior to compacting the sketch in which case the number of valid entries is never larger than k.

  Empty After Rebuild() Estimating Avg Estimating Max
Nominal Entries (k) : Formula -> 8 k*8 +24 k*12 + 24 k*15 + 24
16 8 152 216 264
32 8 280 408 504
64 8 536 792 984
128 8 1,048 1,560 1,944
256 8 2,072 3,096 3,864
512 8 4,120 6,168 7,704
1,024 8 8,216 12,312 15,384
2,048 8 16,408 24,600 30,744
4,096 8 32,792 49,176 61,464
8,192 8 65,560 98,328 122,904
16,384 8 131,096 196,632 245,784
32,768 8 262,168 393,240 491,544
65,536 8 524,312 786,456 983,064
131,072 8 1,048,600 1,572,888 1,966,104
262,144 8 2,097,176 3,145,752 3,932,184
524,288 8 4,194,328 6,291,480 7,864,344
1,048,576 8 8,388,632 12,582,936 15,728,664
Alpha Sketch

The number of valid entries in the Alpha Sketch after it enters estimation mode is a random variable that has a probability distribution with a mean of k and a standard deviation of sqrt(k). The last column computes the maximum size with a confidence of 99.997% representing plus 4 standard deviations.

  Empty Estimating Avg Std Dev Max @ 99.997%
Nominal Entries (k) : Formula -> 8 k*8 + 24 sqrt(k) (k+4SD)*8 +24
512 8 4,120 23 4,844
1,024 9 8,216 32 9,240
2,048 10 16,408 45 17,856
4,096 11 32,792 64 34,840
8,192 12 65,560 91 68,456
16,384 13 131,096 128 135,192
32,768 14 262,168 181 267,961
65,536 15 524,312 256 532,504
131,072 16 1,048,600 362 1,060,185
262,144 17 2,097,176 512 2,113,560
524,288 18 4,194,328 724 4,217,498
1,048,576 19 8,388,632 1,024 8,421,400

Set Operation Family

Union

The Union operation has both a Java heap and a direct, off-heap variant. When a Union operation is converted to a byte array using toByteArray(), the internal structure is a 32-byte preamble followed by a non-contiguous hash table array of 8-byte, long data entries. This enables the Union to be quickly reconstructed from the byte array so that updating of the Union can be continued.

Intersection

The Intersection operation has both a Java heap and a direct, off-heap variant. When an Intersection operation is converted to a byte array using toByteArray(), the internal structure is a 24-byte preamble followed by a non-contiguous hash table array of 8-byte, long data entries. This enables the Intersection to be quickly reconstructed from the byte array so that updating of the Intersection can be continued.

A not B

The A not B operation is asymmetric and stateless. Both the A and B arguments are presented and the difference is computed and returned. There is no need for a byte array form.