Designed for Large-scale Computing Systems

Multiple Languages

  • The DataSketches library is now available in three languages, Java, C++, and Python. A fourth language, Go, is in development.

Compatibility Across Languages, Software Versions And Binary Serialization Versions

Large-scale computing environments may have a mix of various platforms utilizing different programming languages each with the possiblity of using different Software Versions of our DataSketches library. Cross version compatibility of software is a challenge that all platforms face in general, and it is up to the platform maintainers to keep their software up-to-date. This not new and not different with the DataSketches library.

Nonetheless, it our goal to strive to make it as easy as practically possible to serialize our sketches in one of our supported languages on one platform and to be deserialized in a different supported language, potentially on a different, even remote platform, and perhaps much later in time.

With this goal in mind, here are some of the key strategic decisions we have made in the development of the DataSketches library.

Two levels of versioning.

  • Software Version: This is the release version, published via Apache.org and specified in the POM file or equivalent. This can change relatively frequently based on bug fixes and introduction of new capabilities. We follow the principles of Semantic Versioning as specified by semver.org.

  • Serialization Version: (SerVer) This is a small integer placed in the preamble of the serialized byte array that indicates the version of the serialized structure for the sketch. This is very similar to Java’s Class File Format Version. A single SerVer may represent multiple structures all based on the same sketch when stored in different states, e.g., Single Item, Compact, Updatable, etc). This SerVer changes very rarely, if at all. Of all of our sketches, only a few, e.g., Theta, KLL and Sampling, have had more than one SerVer over time. There are and will be many Software Versions of the same sketch that still use the same SerVer. When we have to update the SerVer, we provide the capability in the Software Version of the code associated with the new SerVer the ability to read and convert the old SerVer to the new SerVer. This is why our newest Software Versions can still read and interpret older SerVer serialized sketches that go back to when our project was started at Yahoo (2012), and before we went open-source (2015). Technically speaking this can be described as Backward-Transient compatibility Schema Evolution and Compatibility and Schema Evolution.

From the user’s perspective, as long as the SerVer is the same, older Software Versions should be able to read sketch images created by newer Software Versions. But the APIs may be different, obviously. An older Software Version will not be able to take advantage of new features introduced in new Software Versions, but it should be able to do what it did before. In other words, there will be no loss of access to the serialized sketch and the older Software Version capabilities. A user should not need to access the SerVer, nonetheless it is always stored in index one of the serialized image. If a sketch is presented with a SerVer that it is not compatible with, the sketch should throw an exception and say what the problem is, just like Java does with its Class File Format Versions.

Sketches requiring user-written custom serialize/deserialize code rely on users to port that custom code themselves for cross-version, cross-language, and cross-platform compatibility.

The Serialized Image of a Sketch

  • The structure (or image) of a serialized sketch is independent of the language from which it was created.
  • The sketch image only contains little-endian primitives, such as int64, int32, int16, int8, double-64, float-32, UTF-8 strings, and simple array structures of those. While these serialized primitives between languages may not be strictly equal they can be interpreted to be logically equivalent. We do not support big-endian serialization.
  • The sketch image is unique for each type of sketch.
  • Simply speaking, a sketch image can be viewed as a blob of bytes, which is easily stored and easily transported using many different protocols, including Protobuf, Avro, Thrift, Byte64, etc.

As a result, sketches serialized in one supported language can be interpreted by a different supported language, with the caveat that due to language differences, availability of resources, and time to develop, not all sketches may be available in all languages at the same time.

Easy Integration with Minimal Dependencies

We strive to make our sketch library easy to integrate into larger systems by keeping the number of external dependencies at a minimum.

  • Java Core
    • The Java core library (including Memory) has no dependencies outside of the Java JVM at runtime allowing simple integration into virtually any Java based system environment.
    • All of the Java components and artifacts are Maven Deployable and registered with The Central Repository
  • C++ Core
    • The C++ core is written as all header files allowing easy integration into a wide range of operating system environments.
  • Python
    • The C++ Core is extended using the python binding library pybind11 enabling high performance operation from Python.

Speed

  • These single-pass, “one-touch” algorithms are fast (see example) to enable real-time processing capability.

  • Sketches can be represented in an updatable or compact form. The compact form is smaller, immutable and faster to merge.

  • Some of the Java sketches have been designed to be instantiated and operated off-heap, whicn eliminates costly serialization and deserialization.

  • The sketch data structures are “additive” and embarrassingly parallelizable. Sketches can be merged without losing accuracy.

System Integrations

The following are system integrations that we have been involved with, but there are many more platform integrations out there that were performed by the individual platform teams.

Specific Sketch Features for Large Data

  • Hash Seed Handling. Additional protection for managing hash seeds which is particularly important when processing sensitive user identifiers. Available with Theta Sketches.

  • Pre-Sampling. Built-in up-front sampling for cases where additional contol is required to limit overall memory consumption when dealing with millions of sketches. Available with Theta Sketches.

  • Memory Component. Large query systems often require their own heaps outside the JVM in order to better manage garbage collection latencies. The Java sketches utilize this powerful component.

  • Built-in Upper-Bound and Lower-Bound estimators. You are never in the dark about how good of an estimate the sketch is providing. Nearly all the sketches are able to estimate the upper and lower bounds of the estimate given a confidence level.

  • User configurable trade-offs of accuracy vs. storage space as well as other performance tuning options.

  • Small Footprint Per Sketch. The in-memory run-time and storage footprint for both row and column oriented storage are minimized with compact binary representations, which are much smaller than the raw input stream and with a well defined upper bound of size.