register datasketches-memory-2.0.0.jar;
register datasketches-java-3.1.0.jar;
register datasketches-pig-1.1.0.jar;
-- very small sketch just for the purpose of this tiny example
define dataToSketch org.apache.datasketches.pig.frequencies.DataToFrequentStringsSketch('8');
define unionSketch org.apache.datasketches.pig.frequencies.UnionFrequentStringsSketch('8');
define getEstimates org.apache.datasketches.pig.frequencies.FrequentStringsSketchToEstimates();
a = load 'data.txt' as (item:chararray, category);
b = group a by category;
c = foreach b generate flatten(group) as (category), flatten(dataToSketch(a.item)) as (sketch);
-- Sketches can be stored at this point in binary format to be used later:
-- store c into 'intermediate/$date' using BinStorage();
-- The next two lines print the results in human readable form for the purpose of this example
d = foreach c generate category, getEstimates(sketch);
dump d;
-- This can be a separate query.
-- For example, the first part can produce a daily intermediate feed and store it.
-- This part can load several instances of this daily intermediate feed and merge them
-- c = load 'intermediate/$date1,intermediate/$date2' using BinStorage() as (category, sketch);
e = group c all;
f = foreach e generate flatten(unionSketch(c.sketch)) as (sketch);
g = foreach f generate getEstimates(sketch);
describe g;
dump g;
The example input data has 2 fields: item and category. In the first part of the query the data is grouped by category with one FrequentItemsSketch<String> per category. In the second part of the query this intermediate result is merged across categories to produce one sketch. This way the usage of all 3 UDFs is demonstrated: DataToFrequentStringsSketch, UnionFrequentStringsSketch and FrequentStringsSketchToEstimates.
Results:
From ‘dump d’ (one sketch per category):
(c1,{(a,7,7,7),(d,2,2,2),(b,1,1,1)})
(c2,{(a,5,5,5),(d,2,2,2),(e,1,1,1),(c,1,1,1)})
From ‘dump g’ (merged across categories):
({(a,12,12,12),(d,4,4,4),(b,1,1,1),(e,1,1,1),(c,1,1,1)})
From ‘describe g’:
g: {bag_of_item_tuples: {item_tuple: (item: chararray,estimate: long,lower_bound: long,upper_bound: long)}}
In this example the results are exact due to small input.
a c1
a c1
a c1
a c2
a c1
b c1
c c2
d c1
e c2
a c1
a c2
a c2
a c2
d c1
d c2
a c1
a c2
a c1
d c2