Lens AI Profiler Cpp
Classes | Public Types | Public Member Functions | Static Public Member Functions | Static Public Attributes | List of all members
datasketches::frequent_items_sketch< T, W, H, E, A > Class Template Reference

#include <frequent_items_sketch.hpp>

Classes

class  items_deleter
 
class  row
 Row in the output from get_frequent_items. More...
 

Public Types

using vector_row = typename std::vector< row, typename std::allocator_traits< A >::template rebind_alloc< row > >
 
using vector_bytes = std::vector< uint8_t, typename std::allocator_traits< A >::template rebind_alloc< uint8_t > >
 

Public Member Functions

 frequent_items_sketch (uint8_t lg_max_map_size, uint8_t lg_start_map_size=LG_MIN_MAP_SIZE, const E &equal=E(), const A &allocator=A())
 
void update (const T &item, W weight=1)
 
void update (T &&item, W weight=1)
 
void merge (const frequent_items_sketch &other)
 
void merge (frequent_items_sketch &&other)
 
bool is_empty () const
 
uint32_t get_num_active_items () const
 
get_total_weight () const
 
get_estimate (const T &item) const
 
get_lower_bound (const T &item) const
 
get_upper_bound (const T &item) const
 
get_maximum_error () const
 
double get_epsilon () const
 
vector_row get_frequent_items (frequent_items_error_type err_type) const
 
vector_row get_frequent_items (frequent_items_error_type err_type, W threshold) const
 
template<typename SerDe = serde<T>>
size_t get_serialized_size_bytes (const SerDe &sd=SerDe()) const
 
template<typename SerDe = serde<T>>
void serialize (std::ostream &os, const SerDe &sd=SerDe()) const
 
template<typename SerDe = serde<T>>
vector_bytes serialize (unsigned header_size_bytes=0, const SerDe &sd=SerDe()) const
 
string< A > to_string (bool print_items=false) const
 
template<typename SerDe >
auto serialize (unsigned header_size_bytes, const SerDe &sd) const -> vector_bytes
 
template<typename SerDe >
frequent_items_sketch< T, W, H, E, A > deserialize (std::istream &is, const SerDe &sd, const E &equal, const A &allocator)
 
template<typename SerDe >
frequent_items_sketch< T, W, H, E, A > deserialize (const void *bytes, size_t size, const SerDe &sd, const E &equal, const A &allocator)
 

Static Public Member Functions

static double get_epsilon (uint8_t lg_max_map_size)
 
static double get_apriori_error (uint8_t lg_max_map_size, W estimated_total_weight)
 
template<typename SerDe = serde<T>>
static frequent_items_sketch deserialize (std::istream &is, const SerDe &sd=SerDe(), const E &equal=E(), const A &allocator=A())
 
template<typename SerDe = serde<T>>
static frequent_items_sketch deserialize (const void *bytes, size_t size, const SerDe &sd=SerDe(), const E &equal=E(), const A &allocator=A())
 

Static Public Attributes

static const uint8_t LG_MIN_MAP_SIZE = 3
 

Detailed Description

template<typename T, typename W = uint64_t, typename H = std::hash<T>, typename E = std::equal_to<T>, typename A = std::allocator<T>>
class datasketches::frequent_items_sketch< T, W, H, E, A >

Frequent Items sketch.

Based on Java implementation here: https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/frequencies/ItemsSketch.java

Author
Alexander Saydakov

Constructor & Destructor Documentation

◆ frequent_items_sketch()

template<typename T , typename W , typename H , typename E , typename A >
datasketches::frequent_items_sketch< T, W, H, E, A >::frequent_items_sketch ( uint8_t  lg_max_map_size,
uint8_t  lg_start_map_size = LG_MIN_MAP_SIZE,
const E &  equal = E(),
const A &  allocator = A() 
)
explicit

Construct this sketch with parameters lg_max_map_size and lg_start_map_size.

Parameters
lg_max_map_sizeLog2 of the physical size of the internal hash map managed by this sketch. The maximum capacity of this internal hash map is 0.75 times 2^lg_max_map_size. Both the ultimate accuracy and size of this sketch are functions of lg_max_map_size.
lg_start_map_sizeLog2 of the starting physical size of the internal hash map managed by this sketch.
equalinstance of Equality operator
allocatorinstance of an Allocator

Member Function Documentation

◆ deserialize() [1/2]

template<typename T , typename W = uint64_t, typename H = std::hash<T>, typename E = std::equal_to<T>, typename A = std::allocator<T>>
template<typename SerDe = serde<T>>
static frequent_items_sketch datasketches::frequent_items_sketch< T, W, H, E, A >::deserialize ( const void *  bytes,
size_t  size,
const SerDe &  sd = SerDe(),
const E &  equal = E(),
const A &  allocator = A() 
)
static

This method deserializes a sketch from a given array of bytes.

Parameters
bytespointer to the array of bytes
sizethe size of the array
sdinstance of a SerDe
equalinstance of Equality operator
allocatorinstance of an Allocator
Returns
an instance of the sketch

◆ deserialize() [2/2]

template<typename T , typename W = uint64_t, typename H = std::hash<T>, typename E = std::equal_to<T>, typename A = std::allocator<T>>
template<typename SerDe = serde<T>>
static frequent_items_sketch datasketches::frequent_items_sketch< T, W, H, E, A >::deserialize ( std::istream &  is,
const SerDe &  sd = SerDe(),
const E &  equal = E(),
const A &  allocator = A() 
)
static

This method deserializes a sketch from a given stream.

Parameters
isinput stream
sdinstance of a SerDe
equalinstance of Equality operator
allocatorinstance of an Allocator
Returns
an instance of the sketch

◆ get_apriori_error()

template<typename T , typename W , typename H , typename E , typename A >
double datasketches::frequent_items_sketch< T, W, H, E, A >::get_apriori_error ( uint8_t  lg_max_map_size,
estimated_total_weight 
)
static

Returns the estimated a priori error given the max_map_size for the sketch and the estimated_total_stream_weight.

Parameters
lg_max_map_sizethe planned map size to be used when constructing this sketch.
estimated_total_weightthe estimated total stream weight.
Returns
the estimated a priori error.

◆ get_epsilon() [1/2]

template<typename T , typename W , typename H , typename E , typename A >
double datasketches::frequent_items_sketch< T, W, H, E, A >::get_epsilon

Returns epsilon value of this sketch. This is just the value 3.5 / max_map_size.

Returns
epsilon used by the sketch to compute error.

◆ get_epsilon() [2/2]

template<typename T , typename W , typename H , typename E , typename A >
double datasketches::frequent_items_sketch< T, W, H, E, A >::get_epsilon ( uint8_t  lg_max_map_size)
static

Returns epsilon used to compute a priori error. This is just the value 3.5 / maxMapSize.

Parameters
lg_max_map_sizethe planned map size to be used when constructing this sketch.
Returns
epsilon used to compute a priori error.

◆ get_estimate()

template<typename T , typename W , typename H , typename E , typename A >
W datasketches::frequent_items_sketch< T, W, H, E, A >::get_estimate ( const T &  item) const

Returns the estimate of the weight (frequency) of the given item. Note: The true frequency of a item would be the sum of the counts as a result of the two update functions.

Parameters
itemthe given item
Returns
the estimate of the weight (frequency) of the given item

◆ get_frequent_items() [1/2]

template<typename T , typename W , typename H , typename E , typename A >
auto datasketches::frequent_items_sketch< T, W, H, E, A >::get_frequent_items ( frequent_items_error_type  err_type) const

Returns an array of rows that include frequent items, estimates, upper and lower bounds given an error_type and using get_maximum_error() as a threshold.

The method first examines all active items in the sketch (items that have a counter).

If error_type = NO_FALSE_NEGATIVES, this will include an item in the result list if get_upper_bound(item) > threshold. There will be no false negatives, i.e., no Type II error. There may be items in the set with true frequencies less than the threshold (false positives).

If error_type = NO_FALSE_POSITIVES, this will include an item in the result list if get_lower_bound(item) > threshold. There will be no false positives, i.e., no Type I error. There may be items omitted from the set with true frequencies greater than the threshold (false negatives).

Parameters
err_typedetermines whether no false positives or no false negatives are desired.
Returns
an array of frequent items

◆ get_frequent_items() [2/2]

template<typename T , typename W , typename H , typename E , typename A >
auto datasketches::frequent_items_sketch< T, W, H, E, A >::get_frequent_items ( frequent_items_error_type  err_type,
threshold 
) const

Returns an array of rows that include frequent items, estimates, upper and lower bounds given an error_type and a threshold.

The method first examines all active items in the sketch (items that have a counter).

If error_type = NO_FALSE_NEGATIVES, this will include an item in the result list if get_upper_bound(item) > threshold. There will be no false negatives, i.e., no Type II error. There may be items in the set with true frequencies less than the threshold (false positives).

If error_type = NO_FALSE_POSITIVES, this will include an item in the result list if get_lower_bound(item) > threshold. There will be no false positives, i.e., no Type I error. There may be items omitted from the set with true frequencies greater than the threshold (false negatives).

Parameters
err_typedetermines whether no false positives or no false negatives are desired.
thresholdto include items in the result list
Returns
an array of frequent items

◆ get_lower_bound()

template<typename T , typename W , typename H , typename E , typename A >
W datasketches::frequent_items_sketch< T, W, H, E, A >::get_lower_bound ( const T &  item) const

Returns the guaranteed lower bound weight (frequency) of the given item.

Parameters
itemthe given item.
Returns
the guaranteed lower bound weight of the given item. That is, a number which is guaranteed to be no larger than the real weight.

◆ get_maximum_error()

template<typename T , typename W , typename H , typename E , typename A >
W datasketches::frequent_items_sketch< T, W, H, E, A >::get_maximum_error
Returns
An upper bound on the maximum error of get_estimate(item) for any item. This is equivalent to the maximum distance between the upper bound and the lower bound for any item.

◆ get_num_active_items()

template<typename T , typename W , typename H , typename E , typename A >
uint32_t datasketches::frequent_items_sketch< T, W, H, E, A >::get_num_active_items
Returns
the number of active items in the sketch

◆ get_serialized_size_bytes()

template<typename T , typename W , typename H , typename E , typename A >
template<typename SerDe >
size_t datasketches::frequent_items_sketch< T, W, H, E, A >::get_serialized_size_bytes ( const SerDe &  sd = SerDe()) const

Computes size needed to serialize the current state of the sketch. This can be expensive since every item needs to be looked at.

Parameters
sdinstance of a SerDe
Returns
size in bytes needed to serialize this sketch

◆ get_total_weight()

template<typename T , typename W , typename H , typename E , typename A >
W datasketches::frequent_items_sketch< T, W, H, E, A >::get_total_weight

Returns the sum of the weights (frequencies) in the stream seen so far by the sketch

Returns
the total weight of all items in the stream seen so far by the sketch

◆ get_upper_bound()

template<typename T , typename W , typename H , typename E , typename A >
W datasketches::frequent_items_sketch< T, W, H, E, A >::get_upper_bound ( const T &  item) const

Returns the guaranteed upper bound weight (frequency) of the given item.

Parameters
itemthe given item
Returns
the guaranteed upper bound weight of the given item. That is, a number which is guaranteed to be no smaller than the real frequency.

◆ is_empty()

template<typename T , typename W , typename H , typename E , typename A >
bool datasketches::frequent_items_sketch< T, W, H, E, A >::is_empty
Returns
true if this sketch is empty

◆ merge() [1/2]

template<typename T , typename W , typename H , typename E , typename A >
void datasketches::frequent_items_sketch< T, W, H, E, A >::merge ( const frequent_items_sketch< T, W, H, E, A > &  other)

This function merges the other sketch into this one. The other sketch may be of a different size.

Parameters
othersketch to be merged into this (lvalue)

◆ merge() [2/2]

template<typename T , typename W , typename H , typename E , typename A >
void datasketches::frequent_items_sketch< T, W, H, E, A >::merge ( frequent_items_sketch< T, W, H, E, A > &&  other)

This function merges the other sketch into this one. The other sketch may be of a different size.

Parameters
othersketch to be merged into this (rvalue)

◆ serialize() [1/2]

template<typename T , typename W , typename H , typename E , typename A >
template<typename SerDe >
void datasketches::frequent_items_sketch< T, W, H, E, A >::serialize ( std::ostream &  os,
const SerDe &  sd = SerDe() 
) const

This method serializes the sketch into a given stream in a binary form

Parameters
osoutput stream
sdinstance of a SerDe

◆ serialize() [2/2]

template<typename T , typename W = uint64_t, typename H = std::hash<T>, typename E = std::equal_to<T>, typename A = std::allocator<T>>
template<typename SerDe = serde<T>>
vector_bytes datasketches::frequent_items_sketch< T, W, H, E, A >::serialize ( unsigned  header_size_bytes = 0,
const SerDe &  sd = SerDe() 
) const

This method serializes the sketch as a vector of bytes. An optional header can be reserved in front of the sketch. It is a blank space of a given size. This header is used in Datasketches PostgreSQL extension.

Parameters
header_size_bytesspace to reserve in front of the sketch
sdinstance of a SerDe
Returns
serialized sketch as a vector of bytes

◆ to_string()

template<typename T , typename W , typename H , typename E , typename A >
string< A > datasketches::frequent_items_sketch< T, W, H, E, A >::to_string ( bool  print_items = false) const

Returns a human readable summary of this sketch

Parameters
print_itemsif true include the list of items retained by the sketch

◆ update() [1/2]

template<typename T , typename W , typename H , typename E , typename A >
void datasketches::frequent_items_sketch< T, W, H, E, A >::update ( const T &  item,
weight = 1 
)

Update this sketch with an item and a positive weight (frequency count).

Parameters
itemfor which the weight should be increased (lvalue)
weightthe amount by which the weight of the item should be increased A count of zero is a no-op, and a negative count will throw an exception.

◆ update() [2/2]

template<typename T , typename W , typename H , typename E , typename A >
void datasketches::frequent_items_sketch< T, W, H, E, A >::update ( T &&  item,
weight = 1 
)

Update this sketch with an item and a positive weight (frequency count).

Parameters
itemfor which the weight should be increased (rvalue)
weightthe amount by which the weight of the item should be increased A count of zero is a no-op, and a negative count will throw an exception.

The documentation for this class was generated from the following files: