All set operations (Union, Intersection, Difference) between two sketches A and B must obey the following two rules:
These rules can be extended to arbitrary set expressions as:
As long as all source sketches and target have been configured with the same Nominial Entries or k, and there has been no other intervening Intersection or AnotB operations, the accuracy of the resulting Union sketch will have the same Relative Standard Error (RSE) as determined by the Nominal Entries or k value. In other words,
This remains true no matter how many sketches are unioned together.
The Relative Standard Error (RSE) of the resulting Union sketch will determined by the Theta Rules above. Ultimately, the source sketch with the smallest theta will dominate the overall resulting error of the result. Given two sketches with equal cardinalities and different values of k, the sketch with the smaller value of k will have the smallest value of theta and will largely determine the error distribution of the result.
Conceptually, the Intersection and AnotB functions operate by first performing a Union of all the values from both source sketches and then identifying the appropriate proper subset of that Union set creating a result sketch with the Union theta (which is the minimum theta of the source sketches) and the qualifying subset of values.
This means, of course, that depending on the operations and the data, the result set could have zero, all, or some number in between of the retained values of the Union sketch. Mixed set expressions can produce an error distribution that is larger that of a standard sketch of a given Nominal Entries or k and is mathematically described in Sketch Equations / Subsets of Fixed k Sampling.
When the source and target sketches have the same value of k, the accuracy of the result sketch has a relatively straightforward mathematical solution and intuition:
The intuition for this can be explained using this simple example of intersection (set difference would behave similarly):
Suppose we have as inputs two segments of unique identifiers, A and B.
The hash values retained in SA represent a uniform sampling of all of the unique identifiers in segment A and the resulting value of θA represents the effective sampling rate required to end up with k samples, which is ~ 4K/4M = .001.
Even though in the raw data all the values of segment B are in segment A, the probability that all the 4K samples of SB appear SA is extremely low since only one in one thousand can be randomly chosen.
Applying the Theta Rules:
The mean estimate from the intersection sketch will be 4/.001 = 4K. This happens to be correct using this hand-wavy analysis but in general is a random result with a variance. The proof that the estimate will be unbiased is in the attached Sketch Equations.
The RSE of a sketch with only 4 values is ~ 1/sqrt(4) = .5 or 50% error. This is considerably larger than the RSE of either SA or SB, which is ~ 1.6% as stated above.
The insight to be gained from this example is that it was the theta (sampling rate) of the sketch of the larger population that caused the increase in error. And, for this example, increasing the sketch size of SA would improve the resulting error. The general case may be more complex.
More formally, if we define a factor F to be the ratio:
Then a simple way to compute the resulting RSE of an intersection (or difference) is
And in our example:
In the general case this scenario is even more complex and difficult to predict mathematically, but a conservative estimate would be to use the above equations substituting k with the smallest of the participating k values.