WO2018069928A1 - Mts sketch for accurate estimation of set-expression cardinalities from small samples - Google Patents

Mts sketch for accurate estimation of set-expression cardinalities from small samples Download PDF

Info

Publication number
WO2018069928A1
WO2018069928A1 PCT/IL2017/051134 IL2017051134W WO2018069928A1 WO 2018069928 A1 WO2018069928 A1 WO 2018069928A1 IL 2017051134 W IL2017051134 W IL 2017051134W WO 2018069928 A1 WO2018069928 A1 WO 2018069928A1
Authority
WO
WIPO (PCT)
Prior art keywords
cardinality
sample
data structure
stream
value
Prior art date
Application number
PCT/IL2017/051134
Other languages
French (fr)
Inventor
Reuven Cohen
Liran Katzir
Aviv YEHEZKEL
Original Assignee
Technion Research & Development Foundation Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Technion Research & Development Foundation Limited filed Critical Technion Research & Development Foundation Limited
Publication of WO2018069928A1 publication Critical patent/WO2018069928A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled

Definitions

  • the present invention in some embodiments thereof, relates to estimating a cardinality of a single stream and/or set expressions between multiple streams and, more particularly, but not exclusively, to estimating a cardinality of a single stream and/or set expressions between multiple streams using a significantly small sample of each of the streams.
  • One or more of such data processing methodologies may include identifying the cardinality, i.e. the number of distinct elements in streams and/or sets comprising a plurality of elements with repetitions may be of major interest for multiple applications ranging from database queries to network traffic monitoring and network security applications.
  • a computer implemented method of estimating a cardinality of a stream comprising using one or more processors configured to execute a code, the code is adapted for:
  • the first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions.
  • the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample.
  • the MTS based cardinality estimation may present significant advantages for computing an estimated cardinality for extremely large streams and moreover for set expressions of multiple streams using only a significantly small data portion of the stream(s).
  • the computation resources, the storage resources and/or the processing time may be significantly reduced since the subsample is fixed in size and is significantly small.
  • the cardinality estimation may be significantly simplified thus further reducing computation resources, storage resources and/or processing time while maintaining high accuracy of the estimated cardinality.
  • a system for estimating a cardinality of a stream comprising one or more processors adapted to execute code, the code comprising:
  • code instructions to obtain a sample comprising a group of the plurality of elements randomly sampled from the respective stream.
  • Code instructions to compute a first data structure and a second data structure for the sample comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions.
  • the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample.
  • a computer implemented method of estimating a cardinality of set expressions between streams comprising using one or more processors configured to execute a code, the code is adapted for:
  • the first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions.
  • the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample.
  • an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression.
  • Computing the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value. Since the MTS sketch is additive in nature, the MTS algorithms used for estimating the cardinality of a single stream may be easily and efficiently extended for estimating the cardinality of set expressions of the streams, in particular, a set union, a set intersection and a set difference. This may allow computing an estimated cardinality even for set expression of multiple extremely large streams and/or sets of elements.
  • a system for estimating a cardinality of set expressions between streams comprising one or more processors adapted to execute code, the code comprising:
  • Code instructions to compute a first data structure and a second data structure for each of the plurality of samples comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions.
  • the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample.
  • each of the plurality of elements includes one or more members of a group consisting of: a tuple, a word, a symbol, a binary representation, a numeral expression and an internet protocol (IP) packet.
  • IP internet protocol
  • the MTS sketch based cardinality estimation may be applied to estimate the cardinality of a diverse range of stream used by multiple applications which may be of very different nature.
  • the type of the elements of the stream(s) may vary while the same concepts of the MTS sketch based cardinality estimation may apply.
  • the combination function is a union function to create a set union between the plurality of streams, the first data structure comprising the plurality of maximal hash values computed for a concatenation of the plurality of samples, the second data structure is created by selecting the fixed-size subset from the concatenation of the plurality of samples.
  • the MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set union which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
  • the combination function is an intersection function to create a set intersection between the plurality of streams
  • the sample cardinality is created for a set intersection between the second data structure of the plurality of samples
  • the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples.
  • the MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set intersection which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
  • the combination function is a difference function to create a set difference between the plurality of streams
  • the sample cardinality is created for a set difference between the second data structure of the plurality of samples
  • the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples.
  • the MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set difference which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
  • Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.
  • a data processor such as a computing platform for executing a plurality of instructions.
  • the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data.
  • a network connection is provided as well.
  • a display and/or a user input device such as a keyboard or mouse are optionally provided as well.
  • FIG. 1 is a flowchart of an exemplary process of estimating a cardinality of set expressions between streams, according to some embodiments of the present invention
  • FIG. 2 is a schematic illustration of an exemplary system for estimating a cardinality of set expressions between streams, according to some embodiments of the present invention.
  • FIG. 3 is a schematic illustration of a sampled stream space. DESCRIPTION OF SPECIFIC EMBODIMENTS OF THE INVENTION
  • the present invention in some embodiments thereof, relates to estimating a cardinality of a single stream and/or set expressions between multiple streams and, more particularly, but not exclusively, to estimating a cardinality of a single stream and/or set expressions between multiple streams using a significantly small sample of each of the streams.
  • a cardinality of a single stream and/or a set expression in particular, a set union, a set intersection and a set difference between a plurality of streams (sets) using a significantly small sample of each of the streams.
  • Each of the streams comprises a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an internet protocol (IP) packet and/or the like and the sample (sampled stream) of each stream comprises a group of elements randomly sampled from a respective stream.
  • IP internet protocol
  • Estimating the cardinality of streams as well as estimating the cardinality of set expressions between multiple streams may be useful for a plurality of applications ranging from data base queries to network traffic monitoring and security applications.
  • computing a precise (exact) cardinality for the streams and moreover for the set expressions of the streams may be complex and costly at best and impractical at worst as often the streams may be extremely large.
  • the cardinality computation may therefore require high computation resources, large storage resources and may further limit real-time computation.
  • Estimating the cardinality of a stream using only a sample (sampled stream) of the stream which comprises elements randomly selected from the stream is known in the art.
  • such estimation of extremely large streams may also require excessive computation and/or storage resources rendering the estimation impractical.
  • such estimation may not be applicable for the set expressions between multiple large streams.
  • a Maximal-Term with Sample (MTS) methodology presents an MTS sketch used by MTS based algorithms which may be used for accurately estimating the cardinality of the streams as well as the cardinality of the set expression between the plurality of streams using only a significantly small subsample of each of the samples (sampled streams) of the streams.
  • MTS Maximal-Term with Sample
  • the cardinality of the streams as and/or of the set expressions is estimated using an MTS sketch created for each of the samples.
  • Each MTS sketch includes a first data structure (0 0121 ) and a second data structure (0 00 ).
  • the first data structure (0 00 ) comprises a vector of maximal hash values computed for the elements in the respective sample using a plurality of hash functions.
  • the second data structure (0 00 ) is a subsample of the respective sample and comprises a fixed-size subset of elements having the minimal maximal hash values among the elements of the respective sample.
  • an estimated sample cardinality is first computed for the first data structure (0 00 ), i.e. the maximal hash values of elements in the sample using one or more max-sketch cardinality estimation technique, as known in the art, for example, HyperLogLog algorithm and/or the like.
  • the second data structure (0 00 ) i.e. the fixed-size subset of the sample of the stream and applying one or more frequency estimation techniques as known in the art, for example, Good-Turing frequency estimation
  • a ratio value is computed which estimates the proportion between cardinality of the elements appearing only once in the sampled stream (sample) and the cardinality of the elements appearing only once in the full (un-sampled) stream.
  • the MTS methodology may efficiently extend the cardinality estimation to estimate the cardinality of the set expressions between the plurality of streams, i.e. multiple streams.
  • the estimated cardinality may be computed for the set union which may be regarded as single" concatenated stream created by concatenating the plurality of streams. The same technique applied for the single stream may then apply for the concatenated stream.
  • the MTS methodology further extends the cardinality estimation for the other set expression, in particular, the set intersection between the plurality of streams and the set difference between the plurality of streams.
  • the estimated cardinality of the set intersection and/or the set difference may be derived from the cardinality estimation of the set union using set theorem conventions defining relations between the various set expressions, in particular, the Jaccard similarity statistics (also known as intersection over a set union and/or the Jaccard similarity coefficient) which are known in the art.
  • the MTS sketch and algorithms may be used to estimate the cardinality of any sequence of set expressions between any number of streams using a small sample of each of the streams.
  • the Jaccard similarity may be computed for the plurality of streams and/or for the set expression, in particular, the set intersection and the set difference using the MTS sketch, i.e. the first data structure and the second data structure created
  • the MTS based cardinality estimation may present significant advantages for computing an estimated cardinality for extremely large streams and moreover for set expressions of multiple streams compared to existing methods, techniques and/or algorithms for computing and/or estimating the cardinality.
  • Some of the existing methods may compute a precise cardinality for the stream by processing the entire un- sampled stream, i.e. analyzing each element in the stream. Such cardinality computation may require extremely high computation resources, storage resources and/or time thus rendering the cardinality computation inefficient, costly and may typically be impractical for extremely large streams.
  • Other existing methods may apply one or more algorithms to compute an estimator for computing the cardinality of a sample of the stream, i.e. a sampled stream in order to estimate the cardinality of the stream.
  • the computation resources, the storage resources and/or the processing time may be significantly reduced since the subsample is fixed in size and is significantly small.
  • the MTS algorithms may be easily and efficiently extended for estimating the cardinality of set expressions of the streams. This may allow computing an estimated cardinality even for set expression of multiple extremely large streams and/or sets of elements.
  • the cardinality estimation may be significantly simplified thus further reducing computation resources, storage resources and/or processing time.
  • the accuracy of the estimation is maintained as presented herein after.
  • the present invention may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk
  • mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • FIG. 1 illustrates a flowchart of an exemplary process of estimating a cardinality of set expressions between streams, according to some embodiments of the present invention.
  • An exemplary process 100 may be executed to estimate a cardinality of a stream (set) and/or of a set expression, in particular, a set union, a set intersection and a set difference between a plurality of streams (sets) each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
  • the process 100 is applied to estimate the cardinality of the set expression using only a significantly small sample of each of the streams where each sample (sampled stream) comprises a group of elements randomly sampled from a respective stream.
  • the process 100 estimates the cardinality of the single stream and/or of the set expressions using an MTS sketch created for each of the samples where each of the MTS sketches includes a first data structure and a second data structure (subsample) computed for each of the samples.
  • the process 100 computes an estimated sample cardinality for a single stream and/or for set expression(s) of the samples using the first data structure(s) created for the samples by estimating the cardinality of the elements appearing once in the sample(s).
  • the estimated cardinality of the sample and/or set expression(s) of the samples may be computed using one or more cardinality estimation tools as known in the art, for example, HyperLogLog algorithm and/or the like.
  • the estimated sample cardinality is then applied with a computed ratio value which estimates the ratio (proportion) between the cardinality of the elements appearing only one in the sample compared to the cardinality of the elements appearing only once in the full stream.
  • the ratio value is computed using the second data structure(s) and
  • FIG. 2 is a schematic illustration of an exemplary system for estimating a cardinality of set expressions between streams, according to some embodiments of the present invention.
  • An exemplary system 200 for executing a process such as the process 100 to estimate a cardinality of set expressions between streams (sets) comprises a computing node 201 for example, a computer, a server, a cluster of computing nodes and/or any device having one or more processors.
  • the computing node 201 may typically include an input/output (I/O) interface 202 for obtaining a plurality of samples 220 of the plurality of streams, a processor(s) 204 and a storage 206.
  • I/O input/output
  • the I/O interface 202 may provide one or more interconnect interfaces, for example, a network interface, a local interface and/or the like.
  • the network interface may support one or more wired and/or wireless network interfaces for connecting to one or more networks, for example, a Local Area Network (LAN), a wide Area Network (WAN), a Wireless LAN (WLAN) (e.g. Wi-Fi), a cellular network and/or the like.
  • the local interface may include one or more interfaces, for example, a Universal Serial Bus (USB) interface, a memory management controller (MMC) interface, a serial interface and/or the like for connecting to one or more peripheral devices, for example a storage device and/or the like.
  • USB Universal Serial Bus
  • MMC memory management controller
  • the processor(s) 204 may be arranged for parallel processing, as clusters and/or as one or more multi core processor(s).
  • the storage 206 may include one or more computer readable medium devices, either persistent storage and/or volatile memory for one or more purposes, for example, storing program code, storing data, storing intermediate computation products and/or the like.
  • the persistent storage may include one or more persistent memory devices, for example, a Flash array, a Solid State Disk (SSD) and/or the like for storing program code.
  • the volatile memory may also include one or more volatile memory devices, for example, a Random Access Memory (RAM) device.
  • the storage 206 may further include one or more networked storage resources, for example, a storage server, a Network Attached Storage (NAS) and/or the like accessible through the I/O interface 202.
  • NAS Network Attached Storage
  • the processor(s) 204 may execute one or more one or more software modules, for example, a process, an application, an agent, a utility, a script, a plug-in and/or the like.
  • a software module may comprises a plurality of program instructions stored in a non-transitory medium such as the program store 206 and executed by a processor such as the processor(s) 204.
  • the processors) 204 may execute, for example, a cardinality estimator 210 for estimating the cardinality of the set expression, in particular a set union, a set intersection and a set difference between a plurality of streams each comprising a plurality of elements, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
  • the cardinality estimator 210 may estimate the cardinality of the set expression using only a significantly small sample 220 of each of the streams 220 obtained through the I/O interface 202 and/or from the storage 206.
  • the cardinality estimator 210 is executed by one or more virtual machines (VM) hosted by a computing node such as the computing node 201.
  • VM virtual machines
  • the cardinality estimator 210 is utilized as one or more remote services, for example, a remote server service, a cloud service, a Software as a Service (SaaS), a Platform as a Service (PaaS) and/or the like which are accessible over one or more networks from the computing node 201.
  • VM virtual machines
  • the cardinality estimator 210 is utilized as one or more remote services, for example, a remote server service, a cloud service, a Software as a Service (SaaS), a Platform as a Service (PaaS) and/or the like which are accessible over one or more networks from the computing node 201.
  • SaaS Software as a Service
  • PaaS Platform as a Service
  • the process 100 starts with the cardinality estimator 210 receiving a query for estimating a cardinality of a stream and/or of a set expression, in particular, a set union, a set intersection and a set difference between the plurality of streams each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
  • a query for estimating a cardinality of a stream and/or of a set expression in particular, a set union, a set intersection and a set difference between the plurality of streams each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
  • the cardinality estimator 210 obtains the sample 220 of the stream in case of the single stream and/or the samples 220 of the plurality of streams in case of the set expressions where each sample (sampled stream) 220 comprises a group of elements randomly sampled from the respective stream.
  • the cardinality estimator 210 may obtain one or more of the samples 220 from one or more remote location, for example, a server, a cloud service, a cloud storage and/or the like which are accessible from the computing node 201 over one or more networks through the I/O interface 202.
  • the cardinality estimator 210 may also obtain one or more of the samples 220 from the storage 206, either from a local storage and/or from a remote storage resource accessible through the I/O interface 202.
  • the cardinality estimator 210 may obtain the sample(s) 220 from a local hard drive.
  • the cardinality estimator 210 may obtain the sample(s) 220 from a NAS and/or the like.
  • the cardinality estimator 210 may obtain the sample(s) 220 from an attachable storage drive and/or the like.
  • the cardinality estimator 210 computes a first data structure and a second data structure for each of the samples 220. The computation of the first data structure(s) and the second data structure(s)
  • the cardinality estimator 210 computes:
  • cardinality estimator 210 may apply one or more cardinality estimation
  • the cardinality estimator 210 may extend the cardinality estimation techniques applied to the single stream to compute the estimated sample cardinality value of a set union of the samples 220 which may be regarded as a concatenation of the samples 220.
  • the cardinality estimator 210 may apply conventions of the set theorem including, for example, the Jaccard similarity for further extending the cardinality estimation for other set expressions, for example, the set intersection and/or the set difference.
  • the cardinality estimator 210 reduces the ratio value computation to estimation of cardinality of elements appearing only once in the second data structure
  • the cardinality estimator 210 may apply one or
  • Good- Turing frequency estimation technique to compute the ratio between the estimated sample cardinality value and the estimated cardinality value of the entire stream(s).
  • the computation of the estimated sample cardinality value and the computation of the ratio value is described in detail herein after, specifically, Algorithm 1 for the single stream, Algorithm 2 for the set union between two streams, Algorithm 3 for the set intersection between two streams and Algorithm 4 between two streams the set difference.
  • Algorithm 1 for the single stream
  • Algorithm 2 for the set union between two streams
  • Algorithm 3 for the set intersection between two streams
  • Algorithm 4 between two streams the set difference.
  • the cardinality estimator 210 applies, for example, multiplies the computed ratio value to the estimated cardinality computed for the sample 220 (single stream) and/or for the set expression between the samples 220 (multiple stream) to compute an estimated cardinality for the entire stream and/or for the set expression between the entire streams (multiple streams).
  • the computation of the estimated cardinality for the set expression between the streams is described in detail herein after, specifically, Algorithm 1 for the single stream, Algorithm 2 for the set union between two streams, Algorithm 3 for the set intersection between two streams and Algorithm 4 between two streams the set difference.
  • Algorithm 5 for the single stream
  • Algorithm 2 for the set union between two streams
  • Algorithm 3 for the set intersection between two streams
  • Algorithm 4 between two streams the set difference.
  • the Good-Turing frequency estimation technique is useful in many language- related tasks where the problem is to determine the probability that a word appears in a document. Let be a stream of elements possibly with repetitions, and be the set of all different elements, such that Suppose that we want to estimate the probability that a randomly chosen element from the stream 0
  • a naive approach is to choose a sample of elements from the stream 0, and then to set where denotes the number of appearances
  • the hidden mass i.e. the estimator for the hidden
  • elements may be estimated using the relative frequency of the elements that appear exactly once in the sample For example, if 1/10 of the elements in the sample appear only once in the sample then approximately 1/10 of the elements in are unseen elements, namely, they do not appear at all in the sample
  • the Jaccard similarity value ranges between 0, when the two streams and are completely different, and 1 , when the two streams and are identical.
  • An efficient and accurate estimate of is known in the art and may be computed as follows. First, each element in the streams and is hashed into (0, 1). Then, the maximal value of each stream is taken as a sketch that represents the whole stream. As demonstrated in the art, the probability that the sketches of the streams and are equal is exactly . When only one hash function is used, the variance of the estimate of may be infinite. Thus, 0 hash functions may be used, and the sketch representing each of the streams is actually a vector of 0 maximal values. As demonstrated in the art, improved performance may be attained if instead of 0 hash functions only two hash functions with stochastic averaging are used.
  • Equation 1 Equation 1 below to estimate the Jaccard similarity of the streams and .
  • Equation 2 the Jaccard similarity may be generalized to set difference as expressed in Equation 2 below. Equation 2:
  • Equation 1 the estimator presented in Equation 1 may be generalized as expressed in Equation 3 below.
  • the estimation may be performed for a set difference such as .
  • the notations and are used herein after to indicate the Jaccard similarity variables , and respectively.
  • the MTS methodology may be used to accurately estimate cardinality for set expressions of a plurality of streams using only a small sample of each of the streams.
  • the set expressions for example, a set union, a set intersection, a set difference and/or the like are created by applying one or more combination functions, for example, a union, an intersection and a difference respectively to the plurality of streams.
  • the (MTS) methodology and algorithms utilizing the MTS sketch are first presented for estimating the cardinality of set expression of two streams and are extended to set expression of the plurality of streams hereinafter.
  • Estimating the cardinality for a single stream using a generic scheme that combines a sampling process with a cardinality estimation procedure of a single stream as known in the art may consist of two steps: (a) using one or more cardinality estimators as known in the art for estimating cardinality of a sampled stream comprising samples of the original stream; and (b) estimating a sampling ratio, namely, the factor by which the cardinality of the sampled stream should be multiplied in order to estimate the cardinality of the full original stream.
  • Such estimation is typically based on storing a small fixed-size subsample of the sampled stream and using it to estimate the probability of unseen elements using the Good-Turing technique.
  • the scheme used for estimating cardinality of the single stream may be generalized to set expressions between multiple streams.
  • the cardinality estimation is based on maintaining an MTS sketch for each of the plurality of streams which comprises a small fixed-size subsample of the sampled stream (i.e. the sample and using this subsample for estimating the probability of unseen elements.
  • the MTS sketch stores two data structures for each sampled stream (sample a first data structure and a second data
  • Illustration 300 presents a stream comprising a plurality of elements for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
  • a sample 0 is a sampled stream of the stream which comprises E elements randomly sampled from the stream such that Assuming that the sampling rate is 0, the sample includes of the elements of A subsample 0 includes part of the
  • the subsample E may include ⁇ elements of the sample ⁇ which have minimal maximal hash values among the elements of the sample 0.
  • the subsample 0 may be generated using, for example, one-pass reservoir sampling as known in the art. Using the one-pass reservoir sampling implementation, first, the subsample is initialized with the first elements of the sample E, namely, and the
  • the new element is stored in the subsample instead of the element
  • the subsample E stores the E elements whose hash values were minimum, and it can be considered as a uniform subsample of length
  • MTS sketch is additive, i.e., the MTS sketch of a set union of a plurality of streams may be computed directly from the MTS sketches of the streams.
  • Corollary 1 summarizes this additivity property for two streams, which may be generalized for any streams.
  • the MTS methodology is first described for a single stream.
  • estimating the cardinality for the single stream may be done by applying the Good-Turing technique to combine the sampling process with the generic cardinality estimation procedure of the single stream.
  • the Good-Turing algorithm may receive the sampled stream (i.e.
  • the Good- Turing algorithm consists of two steps: (a) estimating a cardinality of the sample 13 using any procedure for estimating the cardinality of a single stream without sampling as known in the art, the procedure is designated CAR EST PROC herein after; and (b) estimating ration factor by which the cardinality of the sample (sampled stream) should be multiplied in order to estimate the cardinality of the full original stream
  • step (a) CAR EST PROC is invoked using storage units
  • step (b) it is noted that the probability for unseen elements in the stream may be expressed as Therefore, the problem of estimating may be reduced to estimating the probability of unseen elements. According to the Good-Turing technique, is a consistent
  • the storage elements number as well as processing resources may be significantly reduced thus reducing cost, complexity, time and/or the like by reducing the estimation problem to computing an approximation of the value of using the subsample of the sample according to some embodiments of the present
  • This algorithm for estimating the cardinality of a single stream using the MTS sketch may be formulated by algorithm 1 below which utilizes procedure 1 below for estimating
  • algorithm 1 may be extended for estimating the cardinality of a set union of the two stream and Assuming the samples be the samples (sampled streams) of the streams and
  • Algorithm 2 which in turn may use Algorithm 1 for processing the MTS sketch of the concatenation 13 .
  • algorithm 1 and algorithm 2 may be extended for estimating the cardinality of a set intersection of the two streams and .
  • a a where is the Jaccard similarity of the two full streams and .
  • Algorithm 2 may therefore be used for estimating a a while the Jaccard similarity for the streams and needs to be estimated.
  • the Jaccard similarity may be expressed as shown in Equation 4 below.
  • Equation 5 may be formulated according to Good-Turing (refer to Table 1 for the notations).
  • Equation 5 into Equation 4 may produce Equation 6 below.
  • Equation 7 Equation 7
  • Algorithm 3 may be used for estimating the cardinality a a of the set intersection of using the samples In algorithm 3, may be estimated using Procedure 1. Additionally, may also be estimated using Procedure 1 using the . Finally, may be estimated from and
  • algorithm 1 and algorithm 2 may be similarly extended for estimating the cardinality of a set difference of the two streams and .
  • a a where 0 according to Equation 2.
  • Algorithm 3 may be used for estimating the cardinality a a of the set difference using the samples C3 ⁇ 4 and 3 ⁇ 4, with the only difference being that the Jaccard similarity variable is estimated rather than .
  • Equation 8 Equation 8
  • Equation 9 may follow.
  • Equation 9 may be rewritten as
  • Algorithm 4 which is an adjustment of Algorithm 3 may be used for estimating the cardinality a a of the set difference of using the samples and In algorithm 4 may be estimated using Procedure 1.
  • Algorithm 4 may be used for estimating the cardinality a a of the set difference of using the samples and In algorithm 4 may be estimated using Procedure 1.
  • the MTS methodology in particular Algorithm 1, Algorithm 2, Algorithm 3 and/or Algorithm 4 may be extended to estimate the cardinality of set expressions between streams, where . Assuming are streams, and are the respective samples, i.e. their respective sampled streams. The samples may be used to estimate the
  • the sample 0 may be expressed as
  • Equation 11 Equation 11:
  • indicator variable is 1 if, for the hash function, satisfy the condition implied by the set expressions, and is 0
  • Algorithm 5 may be used for estimating the cardinality of the set
  • Algorithm 5 consists of three steps: (a) using Equation 11 to estimate ; (b) using CAR EST PROC to estimate and (c) using Procedure 5 to estimate— , the factor (ratio) by which the cardinality 12 of the sampled stream 13 should be multiplied in order to estimate the cardinality 13 of the full stream 13.
  • Algorithm 5 may use Procedure 5 below for estimating
  • the correctness of the MTS methodology in particular, the correctness of Algorithm 1, Algorithm 2, Algorithm 3, Algorithm 4 and Algorithm 5 may be verified through an analytical analysis.
  • Lemma 1 is presented to describe how to compute probability distribution of a product of two normally distributed random variables whose covariance is 0.
  • HyperLogLog estimator belongs to a family of sketches and is may present
  • the standard error of the HyperLogLog estimator is represents a number of storage units (e.g. registers) used for the estimation procedure.
  • Pseudo-code of the HyperLogLog procedure is presented in Algorithm 6 below. Algorithm 6:
  • Lemma 2 summarizes the statistical performance of Algorithm 6 without sampling, i.e., when the algorithm processes the entire stream.
  • the considered set, 13 is the estimated cardinality computed using Algorithm 6, and is the number of storage units used by Algorithm 6.
  • Algorithm 1 estimates with mean value and variance namely, where
  • Algorithm 2 estimates a with mean value and variance
  • Lemmas i.e. Lemma 3, Lemma 4 and Lemma 5 are used herein after for the analysis of the performance of Algorithm 3 and Algorithm 4 using the MTS sketch.
  • Procedure 3 estimates with mean value and variance namely,
  • Lemma 4 may be proved as follows:
  • Procedure 3 estimates j) eno ⁇ e me distnict elements in the union subsample as For each the probability that
  • Lemma 5 may be proved as follows:
  • Equation 12 follows from covariance properties.
  • Equation 13 As shown in Procedure 3, may be written as expressed in Equation 13 below.
  • Equation 14 the covariance may be expressed as shown in Equation 14 below.
  • Theorem 2 below states the asymptotic statistical performance of Algorithm 3 used for estimating the cardinality of the set intersection as described herein above.
  • Algorithm 3 estimates a with mean value and variance namely,
  • Theorem 2 may be proved as follows: may be denoted Similarly may be denoted with the respective expression.
  • the estimator in Algorithm 3 as expressed in Equation 7 may be rewritten as follows:
  • the asymptotic distribution of may be first analyzed. Recall that according to
  • the variance may be expressed as
  • Equation 17 Equation 17
  • Theorem 3 below states the asymptotic statistical performance of Algorithm 3 used for estimating the cardinality of the set difference as described herein above.
  • Algorithm 4 estimates with mean value and variance namely,
  • Theorem 3 below states the asymptotic statistical performance of Algorithm 4 used for estimating the cardinality of the set difference as described herein above.
  • Lemma 6 below is used for the analysis.
  • Equation 1 the estimation of is normally distributed with mean and variance
  • Lemma 6 may be proved for where the proof for and is similar. As known in the art, for the hash function the following applies:
  • Equation and Equation 18 follows that is a sum of 3 Bernoulli variables. Therefore, it is binomially distributed, and can be asymptotically approximated to normal distribution as , namely,
  • Algorithm 5 used for estimating the cardinality of set expressions, in particular a set union, a set intersection and a set difference between 13 streams (13 13 ) as described herein above.
  • Theorem 4 below states the asymptotic statistical performance of Algorithm 5.
  • Algorithm 5 estimates with mean value and variance— namely
  • Theorem 4 may be proved as follows:
  • Simulation tests were conducted to validate the MTS methodology, in particular the Theorems presented herein above developed to analyze and prove the MTS Algorithms 1 , 2, 3, 4 and 5, in particular to validate the asymptotic bias and variance performance of the presented MTS algorithms. More specifically, the simulation was conducted to demonstrate the following:
  • Algorithms 3 and 4 are unbiased, as proven by Theorems 2 and 3.
  • Algorithm 5 The variance of Algorithm 5 is close to its analyzed variance in Theorem 4.
  • Uniform distribution The frequency of the elements is uniformly distributed between 100 and 1, 000; i.e.,
  • the Pareto distribution has several unique properties. In particular, if the Pareto distribution has infinite variance, and if , the Pareto distribution has infinite mean. As decreases, a larger portion of the probability mass is in the tail of the distribution, and the Pareto distribution is therefore useful when a small percentage of the population controls the majority of the measured quantity.
  • Each of the simulation tests was repeated for 1 ,000 different streams (sets) and . Thus, for each of the simulated MTS Algorithms and for each value of a vector of 1 ,000 different estimations was produced.
  • the variance and bias of this vector were computed and the results as presented herein after are considered as the variance and bias of the respective Algorithm for a specific value of .
  • Each such computation is represented by one table row in Table 2, Table 3 and Table 4 below.
  • the vector of estimations for a specific Algorithm and for a specific value of may be expressed as A mean of the vector may be expressed as 0
  • the bias and variance of 0 are computed as follows:
  • the sampling ratio is In each table row we present the bias.
  • the measured bias values are significantly low and practically tend to 0, indicating insignificant bias thus complying and in agreement with the analytical analysis for the bias of Algorithms 3 and 4.
  • the expected length of each original stream is .
  • a total storage budget of storage units per stream which is about 0.006% of the stream length, yields accurate estimation for both set intersection (Alg. 3) and set difference (Alg. 4) cardinalities.
  • the expected length of each original stream is 500 ⁇ 106.
  • Using a total storage budget of storage units, namely, of the stream length yields significantly accurate estimations for both set intersection and set difference cardinalities.
  • Table 3 and Table 4 below present simulation tests results for both Algorithms 3 and 4 for different values of using uniform and Pareto frequency distributions.
  • the sampling ratio is A and two values of are used, .
  • the results are averaged over 1 , 000 runs of the simulation tests and the "analysis" variance is determined according to Theorems 2 and 3.
  • the simulation tests aim to confirm Theorem 4 presented to analyze and theoretically verify Algorithm 5.
  • the simulation tests for Algorithms 5 were conducted over three streams (sets), , and , each with distinct elements and uniformly distributed frequencies as described herein above for the simulation of Algorithms 3 and 4.
  • the simulation tests were conducted to estimate the cardinality of a set expression a a which, as known in the art, may be expressed
  • Table 5 presents the simulation tests results for different intersection values using uniform frequency distributions for buckets and .
  • the sampling ratio is .
  • the results are averaged over 1, 000 runs of the simulation tests, and the "analysis" variance is determined according to Theorem 4.
  • the MTS methodology may be applied to a plurality of applications in a wide variety of domains.
  • the MTS methodology may be used for Query optimization which may be required by database systems to determine a best (low-cost) plan for processing queries.
  • the query optimization may be processed by a query optimizer which estimates the cost of a plan according to the input/output cardinalities of each plan's operator.
  • Accurate cardinality estimation of set expressions over table fields in one scan using fixed memory as done by the MTS based cardinality estimation for the set expressions may be significantly valuable for such query optimizers. For example, assuming three large relational databases, with a shared field In case of processing a query such as , where 13 is a stream of tuples in the field The database system may need to determine the best (low-cost) plan for processing this query.
  • the query optimizer may efficiently estimate the cardinality of the set expression
  • the database system may select the best join strategy.
  • the system may estimate the size of the outcome join.
  • the MTS methodology may be used for network monitoring and security. Network management may require continuous measurement of multiple network parameters whose values may be efficiently estimated using the MTS sketch, using only a small portion of the monitored data.
  • real-time detection of anomalies may be feasible. For example, assuming a network where ⁇ packets are received from different
  • Two examples of monitoring applications may be as follows:
  • cardinality estimation procedure and sampling technique are intended to include all such new technologies a priori.
  • a compound or “at least one compound” may include a plurality of compounds, including mixtures thereof.
  • range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Abstract

A computer implemented method of estimating a cardinality of a stream, comprising: receiving a query for estimating a cardinality of a stream comprising a plurality of elements, obtaining a sample comprising a group of the plurality of elements randomly sampled from the respective stream, computing a first and second data structures for the sample used to compute an estimated sample cardinality of the sample and a ratio indicative of a proportion between the estimated sample cardinality and the estimated cardinality of the stream and computing the estimated cardinality of the stream by applying the ratio to the estimated sample cardinality. Where the first data structure comprises a plurality of maximal hash values computed for the sample using a plurality of hash functions and the second data structure comprises a fixed- size subset of the elements having a minimal hash value among the elements of the group.

Description

MTS SKETCH FOR ACCURATE ESTIMATION OF SET-EXPRESSION
CARDINALITIES FROM SMALL SAMPLES
FIELD AND BACKGROUND OF THE INVENTION
The present invention, in some embodiments thereof, relates to estimating a cardinality of a single stream and/or set expressions between multiple streams and, more particularly, but not exclusively, to estimating a cardinality of a single stream and/or set expressions between multiple streams using a significantly small sample of each of the streams.
With the evolution of information technology, the amount of data that is processed and/or transferred is constantly growing presenting major challenges to multiple applications that may need to process extremely large volumes of data, where in many cases such processing may need to be done in real-time.
Therefore, multiple various methods, techniques, frameworks and/or the like are continually developed to support and enable such applications to process the increasing data volumes.
One or more of such data processing methodologies may include identifying the cardinality, i.e. the number of distinct elements in streams and/or sets comprising a plurality of elements with repetitions may be of major interest for multiple applications ranging from database queries to network traffic monitoring and network security applications.
SUMMARY OF THE INVENTION
According to a first aspect of the present invention there is provided a computer implemented method of estimating a cardinality of a stream, comprising using one or more processors configured to execute a code, the code is adapted for:
Receiving a query for estimating a cardinality of a stream comprising a plurality of elements.
Obtaining a sample comprising a group of the plurality of elements randomly sampled from the respective stream.
Computing a first data structure and a second data structure for the sample. The first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample.
Computing, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream.
Computing the estimated cardinality value of the stream by applying the ratio value to the estimated sample cardinality value.
The MTS based cardinality estimation may present significant advantages for computing an estimated cardinality for extremely large streams and moreover for set expressions of multiple streams using only a significantly small data portion of the stream(s). By accurately estimating the cardinality for the subsample of the sampled stream (sample) of the stream as done by the MTS algorithm, the computation resources, the storage resources and/or the processing time may be significantly reduced since the subsample is fixed in size and is significantly small. Moreover, by reducing the cardinality estimation problem for estimating the cardinality of the sample to estimating the cardinality of elements appearing only once in the sample the cardinality estimation may be significantly simplified thus further reducing computation resources, storage resources and/or processing time while maintaining high accuracy of the estimated cardinality.
According to a second aspect of the present invention there is provided a system for estimating a cardinality of a stream, comprising one or more processors adapted to execute code, the code comprising:
- Code instructions to receive a query for estimating a cardinality of a stream comprising a plurality of elements;
code instructions to obtain a sample comprising a group of the plurality of elements randomly sampled from the respective stream.
Code instructions to compute a first data structure and a second data structure for the sample. The first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample.
Code instructions to compute, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream.
Code instructions to compute the estimated cardinality value of the stream by applying the ratio value to the estimated sample cardinality value.
According to a third aspect of the present invention there is provided a computer implemented method of estimating a cardinality of set expressions between streams, comprising using one or more processors configured to execute a code, the code is adapted for:
Receiving a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements.
Obtaining a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream.
Computing a first data structure and a second data structure for each of the plurality of samples. The first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample.
Computing, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression. Computing the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value. Since the MTS sketch is additive in nature, the MTS algorithms used for estimating the cardinality of a single stream may be easily and efficiently extended for estimating the cardinality of set expressions of the streams, in particular, a set union, a set intersection and a set difference. This may allow computing an estimated cardinality even for set expression of multiple extremely large streams and/or sets of elements.
According to a fourth aspect of the present invention there is provided a system for estimating a cardinality of set expressions between streams, comprising one or more processors adapted to execute code, the code comprising:
Code instructions to receive a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements.
Code instructions to obtain a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream.
Code instructions to compute a first data structure and a second data structure for each of the plurality of samples. The first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample.
Code instructions to compute, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression.
Code instructions to compute the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value. In a further implementation form of the first, second, third and/or fourth aspects, each of the plurality of elements includes one or more members of a group consisting of: a tuple, a word, a symbol, a binary representation, a numeral expression and an internet protocol (IP) packet. The MTS sketch based cardinality estimation may be applied to estimate the cardinality of a diverse range of stream used by multiple applications which may be of very different nature. In particular, the type of the elements of the stream(s) may vary while the same concepts of the MTS sketch based cardinality estimation may apply.
In a further implementation form of the third and/or fourth aspects, the combination function is a union function to create a set union between the plurality of streams, the first data structure comprising the plurality of maximal hash values computed for a concatenation of the plurality of samples, the second data structure is created by selecting the fixed-size subset from the concatenation of the plurality of samples. The MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set union which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
In a further implementation form of the third and/or fourth aspects, the combination function is an intersection function to create a set intersection between the plurality of streams, the sample cardinality is created for a set intersection between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples. The MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set intersection which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
In a further implementation form of the third and/or fourth aspects, the combination function is a difference function to create a set difference between the plurality of streams, the sample cardinality is created for a set difference between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples. The MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set difference which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.
For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system, hi an exemplary embodiment of the invention, one or more tasks according to exemplary embodiments of method and/or system as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions. Optionally, the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data. Optionally, a network connection is provided as well. A display and/or a user input device such as a keyboard or mouse are optionally provided as well. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)
Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.
In the drawings:
FIG. 1 is a flowchart of an exemplary process of estimating a cardinality of set expressions between streams, according to some embodiments of the present invention;
FIG. 2 is a schematic illustration of an exemplary system for estimating a cardinality of set expressions between streams, according to some embodiments of the present invention; and
FIG. 3 is a schematic illustration of a sampled stream space. DESCRIPTION OF SPECIFIC EMBODIMENTS OF THE INVENTION
The present invention, in some embodiments thereof, relates to estimating a cardinality of a single stream and/or set expressions between multiple streams and, more particularly, but not exclusively, to estimating a cardinality of a single stream and/or set expressions between multiple streams using a significantly small sample of each of the streams.
According to some embodiments of the present invention, there are provided methods, systems and computer program products for estimating a cardinality of a single stream and/or a set expression, in particular, a set union, a set intersection and a set difference between a plurality of streams (sets) using a significantly small sample of each of the streams. Each of the streams comprises a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an internet protocol (IP) packet and/or the like and the sample (sampled stream) of each stream comprises a group of elements randomly sampled from a respective stream.
Estimating the cardinality of streams as well as estimating the cardinality of set expressions between multiple streams may be useful for a plurality of applications ranging from data base queries to network traffic monitoring and security applications. However, computing a precise (exact) cardinality for the streams and moreover for the set expressions of the streams may be complex and costly at best and impractical at worst as often the streams may be extremely large. The cardinality computation may therefore require high computation resources, large storage resources and may further limit real-time computation. Estimating the cardinality of a stream using only a sample (sampled stream) of the stream which comprises elements randomly selected from the stream is known in the art. However, such estimation of extremely large streams may also require excessive computation and/or storage resources rendering the estimation impractical. Moreover, such estimation may not be applicable for the set expressions between multiple large streams.
According to some embodiments of the present invention, a Maximal-Term with Sample (MTS) methodology presents an MTS sketch used by MTS based algorithms which may be used for accurately estimating the cardinality of the streams as well as the cardinality of the set expression between the plurality of streams using only a significantly small subsample of each of the samples (sampled streams) of the streams.
The cardinality of the streams as and/or of the set expressions is estimated using an MTS sketch created for each of the samples. Each MTS sketch includes a first data structure (0 0121 ) and a second data structure (0 00 ). The first data structure (0 00 ) comprises a vector of maximal hash values computed for the elements in the respective sample using a plurality of hash functions. The second data structure (0 00 ) is a subsample of the respective sample and comprises a fixed-size subset of elements having the minimal maximal hash values among the elements of the respective sample.
For a single stream, an estimated sample cardinality is first computed for the first data structure (0 00 ), i.e. the maximal hash values of elements in the sample using one or more max-sketch cardinality estimation technique, as known in the art, for example, HyperLogLog algorithm and/or the like. Using the second data structure (0 00 ), i.e. the fixed-size subset of the sample of the stream and applying one or more frequency estimation techniques as known in the art, for example, Good-Turing frequency estimation, a ratio value is computed which estimates the proportion between cardinality of the elements appearing only once in the sampled stream (sample) and the cardinality of the elements appearing only once in the full (un-sampled) stream. As the MTS sketch is additive, The MTS methodology may efficiently extend the cardinality estimation to estimate the cardinality of the set expressions between the plurality of streams, i.e. multiple streams. First, the estimated cardinality may be computed for the set union which may be regarded as single" concatenated stream created by concatenating the plurality of streams. The same technique applied for the single stream may then apply for the concatenated stream. The MTS methodology further extends the cardinality estimation for the other set expression, in particular, the set intersection between the plurality of streams and the set difference between the plurality of streams. The estimated cardinality of the set intersection and/or the set difference may be derived from the cardinality estimation of the set union using set theorem conventions defining relations between the various set expressions, in particular, the Jaccard similarity statistics (also known as intersection over a set union and/or the Jaccard similarity coefficient) which are known in the art. In general the MTS sketch and algorithms may be used to estimate the cardinality of any sequence of set expressions between any number of streams using a small sample of each of the streams.
The Jaccard similarity may be computed for the plurality of streams and/or for the set expression, in particular, the set intersection and the set difference using the MTS sketch, i.e. the first data structure
Figure imgf000011_0001
and the second data structure created
Figure imgf000011_0002
for the samples and/or the set expression between the samples.
The MTS based cardinality estimation may present significant advantages for computing an estimated cardinality for extremely large streams and moreover for set expressions of multiple streams compared to existing methods, techniques and/or algorithms for computing and/or estimating the cardinality. Some of the existing methods may compute a precise cardinality for the stream by processing the entire un- sampled stream, i.e. analyzing each element in the stream. Such cardinality computation may require extremely high computation resources, storage resources and/or time thus rendering the cardinality computation inefficient, costly and may typically be impractical for extremely large streams. Other existing methods may apply one or more algorithms to compute an estimator for computing the cardinality of a sample of the stream, i.e. a sampled stream in order to estimate the cardinality of the stream. However, such algorithms may be sensitive to the order of the elements and/or to the repetition pattern of the elements. Moreover, in case of extremely large streams, in particular streams that need to be processed in real-time, the samples themselves may be significantly large thus requiring extensive computation and/or storage resources. Such algorithms may therefore not be suitable to real world applications in which large streams need to be processed in real time.
By accurately estimating the cardinality for the subsample of the sampled streams (samples) as done by the MTS algorithms, the computation resources, the storage resources and/or the processing time may be significantly reduced since the subsample is fixed in size and is significantly small.
Moreover, as the MTS sketch is additive in nature, the MTS algorithms may be easily and efficiently extended for estimating the cardinality of set expressions of the streams. This may allow computing an estimated cardinality even for set expression of multiple extremely large streams and/or sets of elements.
Furthermore, by reducing the cardinality estimation problem for estimating the cardinality of the sample(s) to estimating the cardinality of elements appearing only once in the sample and/or in the set expressions between the samples, the cardinality estimation may be significantly simplified thus further reducing computation resources, storage resources and/or processing time. However, while the cardinality estimation is significantly simplified, the accuracy of the estimation is maintained as presented herein after.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions. Referring now to the drawings, FIG. 1 illustrates a flowchart of an exemplary process of estimating a cardinality of set expressions between streams, according to some embodiments of the present invention. An exemplary process 100 may be executed to estimate a cardinality of a stream (set) and/or of a set expression, in particular, a set union, a set intersection and a set difference between a plurality of streams (sets) each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like. The process 100 is applied to estimate the cardinality of the set expression using only a significantly small sample of each of the streams where each sample (sampled stream) comprises a group of elements randomly sampled from a respective stream.
The process 100 estimates the cardinality of the single stream and/or of the set expressions using an MTS sketch created for each of the samples where each of the MTS sketches includes a first data structure
Figure imgf000015_0001
and a second data structure
Figure imgf000015_0002
(subsample) computed for each of the samples. The process 100 computes an estimated sample cardinality for a single stream and/or for set expression(s) of the samples using the first data structure(s)
Figure imgf000015_0003
created for the samples by estimating the cardinality of the elements appearing once in the sample(s). The estimated cardinality of the sample and/or set expression(s) of the samples may be computed using one or more cardinality estimation tools as known in the art, for example, HyperLogLog algorithm and/or the like, The estimated sample cardinality is then applied with a computed ratio value which estimates the ratio (proportion) between the cardinality of the elements appearing only one in the sample compared to the cardinality of the elements appearing only once in the full stream. The ratio value is computed using the second data structure(s) and
Figure imgf000015_0004
applying one or more frequency estimation techniques as known in the art, for example, Good-Turing technique.
Reference is also made to FIG. 2, which is a schematic illustration of an exemplary system for estimating a cardinality of set expressions between streams, according to some embodiments of the present invention. An exemplary system 200 for executing a process such as the process 100 to estimate a cardinality of set expressions between streams (sets) comprises a computing node 201 for example, a computer, a server, a cluster of computing nodes and/or any device having one or more processors. The computing node 201 may typically include an input/output (I/O) interface 202 for obtaining a plurality of samples 220 of the plurality of streams, a processor(s) 204 and a storage 206.
The I/O interface 202 may provide one or more interconnect interfaces, for example, a network interface, a local interface and/or the like. The network interface may support one or more wired and/or wireless network interfaces for connecting to one or more networks, for example, a Local Area Network (LAN), a wide Area Network (WAN), a Wireless LAN (WLAN) (e.g. Wi-Fi), a cellular network and/or the like. The local interface may include one or more interfaces, for example, a Universal Serial Bus (USB) interface, a memory management controller (MMC) interface, a serial interface and/or the like for connecting to one or more peripheral devices, for example a storage device and/or the like.
The processor(s) 204, homogenous or heterogeneous, may be arranged for parallel processing, as clusters and/or as one or more multi core processor(s).
The storage 206 may include one or more computer readable medium devices, either persistent storage and/or volatile memory for one or more purposes, for example, storing program code, storing data, storing intermediate computation products and/or the like. The persistent storage may include one or more persistent memory devices, for example, a Flash array, a Solid State Disk (SSD) and/or the like for storing program code. The volatile memory may also include one or more volatile memory devices, for example, a Random Access Memory (RAM) device. The storage 206 may further include one or more networked storage resources, for example, a storage server, a Network Attached Storage (NAS) and/or the like accessible through the I/O interface 202.
The processor(s) 204 may execute one or more one or more software modules, for example, a process, an application, an agent, a utility, a script, a plug-in and/or the like. Wherein a software module may comprises a plurality of program instructions stored in a non-transitory medium such as the program store 206 and executed by a processor such as the processor(s) 204. The processors) 204 may execute, for example, a cardinality estimator 210 for estimating the cardinality of the set expression, in particular a set union, a set intersection and a set difference between a plurality of streams each comprising a plurality of elements, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like. In particular, the cardinality estimator 210 may estimate the cardinality of the set expression using only a significantly small sample 220 of each of the streams 220 obtained through the I/O interface 202 and/or from the storage 206. Optionally, the cardinality estimator 210 is executed by one or more virtual machines (VM) hosted by a computing node such as the computing node 201. Optionally, the cardinality estimator 210 is utilized as one or more remote services, for example, a remote server service, a cloud service, a Software as a Service (SaaS), a Platform as a Service (PaaS) and/or the like which are accessible over one or more networks from the computing node 201.
As shown at 102, the process 100 starts with the cardinality estimator 210 receiving a query for estimating a cardinality of a stream and/or of a set expression, in particular, a set union, a set intersection and a set difference between the plurality of streams each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
As shown at 104, the cardinality estimator 210 obtains the sample 220 of the stream in case of the single stream and/or the samples 220 of the plurality of streams in case of the set expressions where each sample (sampled stream) 220 comprises a group of elements randomly sampled from the respective stream. The cardinality estimator 210 may obtain one or more of the samples 220 from one or more remote location, for example, a server, a cloud service, a cloud storage and/or the like which are accessible from the computing node 201 over one or more networks through the I/O interface 202. The cardinality estimator 210 may also obtain one or more of the samples 220 from the storage 206, either from a local storage and/or from a remote storage resource accessible through the I/O interface 202. For example, the cardinality estimator 210 may obtain the sample(s) 220 from a local hard drive. In another example, the cardinality estimator 210 may obtain the sample(s) 220 from a NAS and/or the like. In another example, the cardinality estimator 210 may obtain the sample(s) 220 from an attachable storage drive and/or the like.
As shown at 106, the cardinality estimator 210 computes a first data structure and a second data structure for each of the samples 220. The
Figure imgf000017_0001
Figure imgf000017_0002
computation of the first data structure(s)
Figure imgf000018_0002
and the second data structure(s)
Figure imgf000018_0003
is described in detail herein after.
As shown at 108, using the first data structure and the second data
Figure imgf000018_0001
structure
Figure imgf000018_0005
the cardinality estimator 210 computes:
(1) An estimated sample cardinality value for the sample 220 of the stream and/or of one or more of the set expressions between the samples 220. Using the first data structure the cardinality estimator 210 may apply one or more cardinality estimation
Figure imgf000018_0004
tools as known in the art, for example, the HyperLogLog algorithm, to estimate the cardinality value of the sample 220 in case of the single stream. For the set expressions, the cardinality estimator 210 may extend the cardinality estimation techniques applied to the single stream to compute the estimated sample cardinality value of a set union of the samples 220 which may be regarded as a concatenation of the samples 220. The cardinality estimator 210 may apply conventions of the set theorem including, for example, the Jaccard similarity for further extending the cardinality estimation for other set expressions, for example, the set intersection and/or the set difference.
(2) A ratio value estimating the ratio (proportion) between the estimated sample cardinality value of the sample 220 (single stream) and/or of the set expression of the samples 220 (set expression between multiple streams) and the estimated cardinality of the entire (un-sampled) stream and/or the set expression between the entire streams respectively. In particular the cardinality estimator 210 reduces the ratio value computation to estimation of cardinality of elements appearing only once in the second data structure The cardinality estimator 210 may apply one or
Figure imgf000018_0006
more techniques as known in the art, for example, Good- Turing frequency estimation technique to compute the ratio between the estimated sample cardinality value and the estimated cardinality value of the entire stream(s).
The computation of the estimated sample cardinality value and the computation of the ratio value is described in detail herein after, specifically, Algorithm 1 for the single stream, Algorithm 2 for the set union between two streams, Algorithm 3 for the set intersection between two streams and Algorithm 4 between two streams the set difference. To estimate the cardinality of the set expression between 0 streams where
Figure imgf000019_0001
the cardinality estimator 210 may apply Algorithm 5 extending Algorithms 2, 3 and/or 4 for the 0 streams.
As shown at 110, the cardinality estimator 210 applies, for example, multiplies the computed ratio value to the estimated cardinality computed for the sample 220 (single stream) and/or for the set expression between the samples 220 (multiple stream) to compute an estimated cardinality for the entire stream and/or for the set expression between the entire streams (multiple streams). The computation of the estimated cardinality for the set expression between the streams is described in detail herein after, specifically, Algorithm 1 for the single stream, Algorithm 2 for the set union between two streams, Algorithm 3 for the set intersection between two streams and Algorithm 4 between two streams the set difference. To estimate the cardinality of the set expression between streams where the cardinality estimator 210 may apply Algorithm 5
Figure imgf000019_0002
extending Algorithms 2, 3 and/or 4 for the 0 streams.
Preliminaries and Basis
Before describing one or more embodiments of the present invention some existing art techniques, methodologies and/or methods for estimating the cardinality are first described, in particular the Good-Turing frequency estimation technique and the Jaccard similarity statistic (also known as intersection over a set union and/or the Jaccard similarity coefficient).
The Good-Turing frequency estimation technique is useful in many language- related tasks where the problem is to determine the probability that a word appears in a document. Let
Figure imgf000019_0003
be a stream of elements possibly with repetitions, and
Figure imgf000019_0004
be the set of all different elements, such that
Figure imgf000019_0008
Suppose that we want to estimate the probability that a randomly chosen element from the stream 0
Figure imgf000019_0007
is 0 . A naive approach is to choose a sample
Figure imgf000019_0009
of elements from the stream 0, and then to set where denotes the number of appearances
Figure imgf000019_0006
Figure imgf000019_0010
of 0 in the sample 0. However, this approach may be inaccurate, because for each element 0 that does not appear even once in the sample
Figure imgf000019_0011
(i.e. an "unseen element"),
Figure imgf000019_0005
Let
Figure imgf000019_0014
be a set of elements that appear exactly
Figure imgf000019_0013
times in the sample Good-Turing frequency estimation claims that
Figure imgf000019_0015
Figure imgf000019_0012
Figure imgf000020_0002
is a consistent estimator for the probability that an element of
Figure imgf000020_0006
appears Etimes in the sample For the case where , the Good-Turing technique therefore suggests
Figure imgf000020_0004
Figure imgf000020_0005
that In other words, the hidden mass (i.e. the estimator for the hidden
Figure imgf000020_0003
Figure imgf000020_0007
elements) may be estimated using the relative frequency of the elements that appear exactly once in the sample
Figure imgf000020_0010
For example, if 1/10 of the elements in the sample
Figure imgf000020_0008
appear only once in the sample then approximately 1/10 of the elements in are
Figure imgf000020_0020
Figure imgf000020_0009
unseen elements, namely, they do not appear at all in the sample
Figure imgf000020_0011
Jaccard similarity, as known in the art, is defined as where
Figure imgf000020_0001
and are two finite streams (sets). The Jaccard similarity value ranges between 0, when the two streams and are completely different, and 1 , when the two streams and are identical. An efficient and accurate estimate of is known in the art and may be computed as follows. First, each element in the streams and is hashed into (0, 1). Then, the maximal value of each stream is taken as a sketch that represents the whole stream. As demonstrated in the art, the probability that the sketches of the streams and are equal is exactly . When only one hash function is used, the variance of the estimate of may be infinite. Thus, 0 hash functions may be used, and the sketch representing each of the streams is actually a vector of 0 maximal values. As demonstrated in the art, improved performance may be attained if instead of 0 hash functions only two hash functions with stochastic averaging are used.
This may be stated formally as follows. Given a stream
Figure imgf000020_0013
and 13 different hash functions
Figure imgf000020_0012
the maximal hash value for the 13 hash function
Figure imgf000020_0014
The sketch of the stream may be therefore expressed as and the sketch of the stream may be expressed
Figure imgf000020_0015
The two sketches can then be used as expressed in
Figure imgf000020_0016
Figure imgf000020_0019
Equation 1 below to estimate the Jaccard similarity of the streams and .
Equation 1:
Figure imgf000020_0017
where the indicator variable and 0 otherwise.
Figure imgf000020_0018
As known in the art, the Jaccard similarity may be generalized to set difference as expressed in Equation 2 below. Equation 2:
Figure imgf000021_0001
Thus, the estimator presented in Equation 1 may be generalized as expressed in Equation 3 below.
Equation 3:
Figure imgf000021_0002
where the indicator variable is 1 if and 0 otherwise. A similar
Figure imgf000021_0004
estimation may be performed for a set difference such as
Figure imgf000021_0005
. In order to simplify the notations, the notations , and are used herein after to indicate the Jaccard similarity variables ,
Figure imgf000021_0006
and respectively.
Figure imgf000021_0007
MTS Based Cardinality Estimation for a Set-Expression
According to some embodiments of the present inventions, the MTS methodology may be used to accurately estimate cardinality for set expressions of a plurality of streams using only a small sample of each of the streams. The set expressions, for example, a set union, a set intersection, a set difference and/or the like are created by applying one or more combination functions, for example, a union, an intersection and a difference respectively to the plurality of streams.
The (MTS) methodology and algorithms utilizing the MTS sketch are first presented for estimating the cardinality of set expression of two streams and are extended to set expression of the plurality of streams hereinafter.
Table 1 below presents some notations used herein after.
Table 1 :
Figure imgf000021_0008
Figure imgf000022_0014
Estimating the cardinality for a single stream using a generic scheme that combines a sampling process with a cardinality estimation procedure of a single stream as known in the art may consist of two steps: (a) using one or more cardinality estimators as known in the art for estimating cardinality of a sampled stream comprising samples of the original stream; and (b) estimating a sampling ratio, namely, the factor by which the cardinality of the sampled stream should be multiplied in order to estimate the cardinality of the full original stream. Such estimation is typically based on storing a small fixed-size subsample of the sampled stream and using it to estimate the probability of unseen elements using the Good-Turing technique.
According to some embodiments of the present invention, the scheme used for estimating cardinality of the single stream may be generalized to set expressions between multiple streams. The cardinality estimation is based on maintaining an MTS sketch for each of the plurality of streams which comprises a small fixed-size subsample of the sampled stream (i.e. the sample
Figure imgf000022_0002
and using this subsample for estimating the probability of unseen elements. To this end, the MTS sketch stores two data structures for each sampled stream (sample a first data structure and a second data
Figure imgf000022_0003
Figure imgf000022_0005
structure where
includes the maximal hash value for each hash function:
Figure imgf000022_0004
Figure imgf000022_0001
comprises a small fixed-size uniform subsample
Figure imgf000022_0006
of the sample
Figure imgf000022_0013
Reference is now made to FIG. 3, which is a schematic illustration of a sampled stream space. Illustration 300 presents a stream
Figure imgf000022_0007
comprising a plurality of elements
Figure imgf000022_0009
for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like. A sample 0 is a sampled stream of the stream
Figure imgf000022_0008
which comprises E elements randomly sampled from the stream such that
Figure imgf000022_0010
Assuming that the sampling rate is 0, the sample includes
Figure imgf000022_0011
of the elements of A subsample 0 includes part of the
Figure imgf000022_0012
sample in particular, the subsample E may include Ξ elements of the sample Ξ which have minimal maximal hash values among the elements of the sample 0. The subsample 0 may be generated using, for example, one-pass reservoir sampling as known in the art. Using the one-pass reservoir sampling implementation, first, the subsample is
Figure imgf000023_0007
initialized with the first
Figure imgf000023_0008
elements of the sample E, namely, and the
Figure imgf000023_0006
elements are then sorted in decreasing order of their hash values. When a new element is sampled into the sample the hash value of the newly sampled element is compared to
Figure imgf000023_0009
the current maximal hash value of the elements in the subsample
Figure imgf000023_0010
In case the hash value of the new element is smaller than the current maximal hash value of the elements in the subsample the new element is stored in the subsample instead of the element
Figure imgf000023_0011
Figure imgf000023_0014
having the maximal hash value. Otherwise, the new element is ignored. After all elements of the sample
Figure imgf000023_0012
are processed, the subsample E stores the E elements whose hash values were minimum, and it can be considered as a uniform subsample of length
Figure imgf000023_0013
It should be noted that MTS sketch is additive, i.e., the MTS sketch of a set union of a plurality of streams may be computed directly from the MTS sketches of the streams. Corollary 1 below summarizes this additivity property for two streams, which may be generalized for any streams.
Figure imgf000023_0015
Corollary 1 :
Assuming and are two streams (sets) with samples designated and
Figure imgf000023_0003
Figure imgf000023_0004
Figure imgf000023_0005
respectively, the MTS sketches of the streams and are:
Figure imgf000023_0001
Then, the MTS sketch of may be expressed as:
Figure imgf000023_0002
Cardinality Estimation for a Single Stream
The MTS methodology is first described for a single stream. As described here in above, estimating the cardinality for the single stream may be done by applying the Good-Turing technique to combine the sampling process with the generic cardinality estimation procedure of the single stream. The Good-Turing algorithm may receive the sampled stream (i.e. the sample as an input and returns an estimate for the cardinality The Good- Turing algorithm consists of two steps: (a) estimating a cardinality of the sample 13 using any procedure for estimating the cardinality of a single stream without sampling as known in the art, the procedure is designated CAR EST PROC herein after; and (b) estimating ration factor
Figure imgf000024_0002
by which the cardinality of the sample
Figure imgf000024_0022
(sampled stream) should be multiplied in order to estimate the cardinality of the full original stream
Figure imgf000024_0023
Figure imgf000024_0003
To estimate
Figure imgf000024_0004
in step (a), CAR EST PROC is invoked using
Figure imgf000024_0021
storage units To estimate in step (b), it is noted that the probability
Figure imgf000024_0005
for unseen elements in the stream may be expressed as
Figure imgf000024_0006
Therefore, the problem of estimating
Figure imgf000024_0007
may be reduced to estimating the probability of
Figure imgf000024_0009
unseen elements. According to the Good-Turing technique, is a consistent
Figure imgf000024_0008
estimator for
Figure imgf000024_0010
as described herein above. Thus, identifying the number
Figure imgf000024_0011
elements that appear exactly once in the sampled stream may be sufficient for estimating the cardinality of the stream
Figure imgf000024_0013
To compute the value precisely as known in the
Figure imgf000024_0012
art, all the elements in the sample
Figure imgf000024_0014
may need to be tracked and while ignoring each previously encountered element. To this end,
Figure imgf000024_0015
storage units may be needed, which is linear in the sample size and is therefore not scalable.
However, the storage elements number as well as processing resources may be significantly reduced thus reducing cost, complexity, time and/or the like by reducing the estimation problem to computing an approximation of the value of
Figure imgf000024_0016
using the subsample of the sample according to some embodiments of the present
Figure imgf000024_0017
Figure imgf000024_0018
invention. This algorithm for estimating the cardinality of a single stream
Figure imgf000024_0019
using the MTS sketch may be formulated by algorithm 1 below which utilizes procedure 1 below for estimating
Figure imgf000024_0020
Algorithm 1:
Figure imgf000024_0001
Figure imgf000025_0001
Procedure 1 :
Figure imgf000025_0002
Cardinality Estimation for a Set Union between Two Streams
According to some embodiments of the present invention, algorithm 1 may be extended for estimating the cardinality of a set union of the two stream and Assuming the samples be the samples (sampled streams) of the streams and
Figure imgf000025_0003
respectively. Let
Figure imgf000025_0004
be the concatenation of the samples
Figure imgf000025_0005
The concatenation is actually a sample of , i.e (refer to Table 1 for
Figure imgf000025_0007
Figure imgf000025_0006
the notation). Thus, estimating the cardinality of is equivalent to estimating the cardinality of a single stream using the concatenation
Figure imgf000025_0008
Estimating the estimating the cardinality of using the samples may be done using
Figure imgf000025_0009
Algorithm 2 below which in turn may use Algorithm 1 for processing the MTS sketch of the concatenation 13 .
Algorithm 2:
Figure imgf000025_0010
Cardinality Estimation of a Set Intersection between Two Streams
According to some embodiments of the present invention, algorithm 1 and algorithm 2 may be extended for estimating the cardinality of a set intersection of the two streams and . As known in the art a
Figure imgf000025_0011
a , where is the Jaccard similarity of the two full streams and . Algorithm 2 may therefore be used for estimating a a while the Jaccard similarity for the streams and needs to be estimated. As known in the art the Jaccard similarity may be expressed as shown in Equation 4 below.
Equation 4:
Figure imgf000026_0001
Equation 5 below may be formulated according to Good-Turing (refer to Table 1 for the notations).
Equation 5:
Figure imgf000026_0002
Similar equations may be formulated to express Substituting
Figure imgf000026_0003
Equation 5 into Equation 4 may produce Equation 6 below.
Equation 6:
Figure imgf000026_0004
or equivalently
Figure imgf000026_0005
Denoting (refer to Table 1 for the notations), Equation 6
Figure imgf000026_0006
may be rewritten as expressed in Equation 7 below.
Equation 7:
Figure imgf000026_0007
Algorithm 3 below may be used for estimating the cardinality a a of the set intersection of using the samples
Figure imgf000026_0012
Figure imgf000026_0010
In algorithm 3,
Figure imgf000026_0008
may be estimated using Procedure 1. Additionally,
Figure imgf000026_0011
may also be estimated using Procedure 1 using the
Figure imgf000026_0009
. Finally, may be estimated from and
Figure imgf000026_0016
Figure imgf000026_0013
using Procedure 3 below.
Figure imgf000026_0014
Algorithm 3:
Figure imgf000026_0015
Figure imgf000027_0001
Procedure 3 :
Figure imgf000027_0002
Cardinality Estimation of a Set Difference between Two Streams
According to some embodiments of the present invention, algorithm 1 and algorithm 2 may be similarly extended for estimating the cardinality of a set difference of the two streams and . As known in the art a
Figure imgf000027_0004
a , where 0 according to Equation 2. Thus, Algorithm 3 may be used for estimating the cardinality a a of the set difference using the samples C¾ and ¾, with the only difference being that the Jaccard similarity variable is estimated rather than .
Applying the inclusion-exclusion principle and some algebraic manipulations, the variable may be formulated as expressed in Equation 8 below.
Equation 8:
Figure imgf000027_0003
By substituting Equation 5 into Equation 8, Equation 9 may follow.
Equation 9:
Figure imgf000028_0001
Using the notations of Table 1 , where Equation 9 may be rewritten as
Figure imgf000028_0002
expressed in Equation 10 below.
Equation 10:
Figure imgf000028_0003
Algorithm 4 below which is an adjustment of Algorithm 3 may be used for estimating the cardinality a a of the set difference of using the samples
Figure imgf000028_0005
and In algorithm 4 may be estimated using Procedure 1. In addition,
Figure imgf000028_0004
may be estimated using Procedure 3.
Figure imgf000028_0006
Algorithm 4:
Figure imgf000028_0007
MTS Based Cardinality Estimation for a Set Expression between Multiple Streams
According to some embodiments of the present invention the MTS methodology, in particular Algorithm 1, Algorithm 2, Algorithm 3 and/or Algorithm 4 may be extended to estimate the cardinality of set expressions between
Figure imgf000028_0011
streams, where
Figure imgf000028_0012
. Assuming are
Figure imgf000028_0008
streams, and
Figure imgf000028_0009
are the respective samples, i.e. their respective sampled streams. The samples may be used to estimate the
Figure imgf000028_0010
cardinality of
Figure imgf000028_0013
. As presented herein above for the case of the two streams and , the sample 0 may be expressed as
021 . The cardinalities of the stream 0 and the sample
Figure imgf000028_0014
Figure imgf000028_0016
may be denoted by
Figure imgf000028_0017
respectively. Denoting as a "generalized" Jaccard similarity the "generalized" Jaccard similarity may be estimated from
Figure imgf000028_0015
in a similar way to the estimation of in Equation 1 as shown
Figure imgf000028_0018
in Equation 11 below. Equation 11:
Figure imgf000029_0001
Where the indicator variable
Figure imgf000029_0002
is 1 if, for the
Figure imgf000029_0003
hash function, satisfy the condition implied by the set expressions, and is 0
Figure imgf000029_0004
otherwise.
Using algebraic manipulations and the definition of the following expression may be obtained:
Figure imgf000029_0005
Thus, may be estimated using the following Equation:
Figure imgf000029_0006
Algorithm 5 below may be used for estimating the cardinality of the set
Figure imgf000029_0007
expression
Figure imgf000029_0008
between the
Figure imgf000029_0009
streams with sampling using the MTS sketch methodology. Algorithm 5 consists of three steps: (a) using Equation 11 to estimate ; (b) using CAR EST PROC to estimate
Figure imgf000029_0010
and (c) using Procedure 5 to estimate— , the factor (ratio) by which the cardinality 12 of the sampled stream 13 should be multiplied in order to estimate the cardinality 13 of the full stream 13.
Algorithm 5 may use Procedure 5 below for estimating
Figure imgf000029_0011
Algorithm 5:
Figure imgf000029_0012
Figure imgf000030_0001
Analytical Analysis
The correctness of the MTS methodology, in particular, the correctness of Algorithm 1, Algorithm 2, Algorithm 3, Algorithm 4 and Algorithm 5 may be verified through an analytical analysis. In order to simplify the notations, the notation 0 to denote the estimated cardinality in each of the Algorithms.
Lemma 1 is presented to describe how to compute probability distribution of a product of two normally distributed random variables whose covariance is 0.
Lemma 1 (Product distribution):
Assuming
Figure imgf000030_0002
are two random variables satisfying the condition
Figure imgf000030_0005
, and then as known in the art, the product asymptotically satisfies
Figure imgf000030_0003
Figure imgf000030_0006
the following condition:
Figure imgf000030_0004
For the analysis, the HyperLogLog algorithm as known in the art is used for the CAR EST PROC procedure in the MTS based Algorithms described herein above. The HyperLogLog estimator belongs to a family of sketches and is may present
Figure imgf000030_0007
improved cardinality estimation compared to other estimators known in the art. The standard error of the HyperLogLog estimator is
Figure imgf000030_0008
represents a number of storage units (e.g. registers) used for the estimation procedure. Pseudo-code of the HyperLogLog procedure is presented in Algorithm 6 below. Algorithm 6:
Figure imgf000031_0001
Lemma 2 below summarizes the statistical performance of Algorithm 6 without sampling, i.e., when the algorithm processes the entire stream.
Lemma 2:
Figure imgf000031_0003
For Algorithm 6, as known in the art where is the actual cardinality of
Figure imgf000031_0002
Figure imgf000031_0004
the considered set, 13 is the estimated cardinality computed using Algorithm 6, and
Figure imgf000031_0005
is the number of storage units used by Algorithm 6.
Corollary 2:
Let and be two streams. When Algorithm 6 is used with 13 storage units and without sampling, the following applies:
Figure imgf000031_0006
As presented in the art, the asymptotic bias and variance of Algorithm 1 was analyzed when using the HyperLogLog algorithm as the CAR EST PROC. It was demonstrated that the sampling rate does not affect the asymptotic unbiasedness of the estimator. The effect of the sampling rate on the estimator's variance was further analyzed with respect to the storage sizes 0 and 13. The following theorem summarizes the statistical performance of Algorithm 1. Theorem 1:
As proved in the art, Algorithm 1 estimates
Figure imgf000032_0003
with mean value
Figure imgf000032_0005
and variance namely, where
Figure imgf000032_0001
Figure imgf000032_0002
In addition, as shown in the art, and satisfy the following
Figure imgf000032_0004
conditions:
Figure imgf000032_0006
where are the distinct elements in the original (un-sampled) stream 13,
Figure imgf000032_0007
and is the frequency of element 0 in stream 0.
As described herein above estimating the set union cardinality using Algorithm 2 is equivalent to estimating the cardinality of a single stream based on its sampled stream 0 . Thus, the statistical performance of Algorithm 2 is equal to that of Algorithm 1.
Corollary 3:
Algorithm 2 estimates
Figure imgf000032_0008
a with mean value
Figure imgf000032_0009
and variance
Figure imgf000032_0010
namely,
Figure imgf000032_0011
where 0 and 0 are as stated in Theorem 1 with respect to the union stream
The following Lemmas, i.e. Lemma 3, Lemma 4 and Lemma 5 are used herein after for the analysis of the performance of Algorithm 3 and Algorithm 4 using the MTS sketch.
Lemma 3:
As proved in the art where
Figure imgf000032_0012
Figure imgf000032_0014
is the length of the subsample
Figure imgf000032_0013
Lemma 4:
Procedure 3 estimates with mean value and variance namely,
Figure imgf000032_0015
Figure imgf000032_0016
where s the cardinality of
Figure imgf000032_0018
Figure imgf000032_0017
Lemma 4 may be proved as follows:
Procedure 3 estimates
Figure imgf000033_0001
j)eno^e me distnict elements in the union subsample as For each the probability that
Figure imgf000033_0002
Figure imgf000033_0003
belongs to may be expressed as follows:
Figure imgf000033_0004
Figure imgf000033_0005
It follows that is a sum of Bernoulli variables with success probability ¾.
Figure imgf000033_0006
Therefore, it is binomially distributed, and can be asymptotically approximated using normal distribution as
Figure imgf000033_0007
Figure imgf000033_0008
Lemma 5:
The covariance (defined similarly)
Figure imgf000033_0009
satisfies the
Figure imgf000033_0010
cardinality of
Figure imgf000033_0011
Lemma 5 may be proved as follows:
Recall that
Figure imgf000033_0012
and similarly for The dependence is between
Figure imgf000033_0013
thus Equation 12 below follows from covariance properties.
Figure imgf000033_0014
Equation 12:
Figure imgf000033_0015
The distinct elements in the union subsample may be denoted
Figure imgf000033_0018
Figure imgf000033_0016
As shown in Procedure 3, may be written as expressed in Equation 13 below.
Figure imgf000033_0017
Equation 1 3 :
Figure imgf000033_0019
where is an indicator variable that gets 1 and 0 otherwise.
Figure imgf000033_0020
Similarly may be rewritten using indicator variables that gets 1 if and 0
Figure imgf000034_0002
Figure imgf000034_0003
Figure imgf000034_0004
otherwise.
Using covariance properties and Equation 13 the covariance may be expressed as shown in Equation 14 below.
Equation 14:
Figure imgf000034_0001
The first and third equalities are due to covariance properties. The second equality is due to the independence
Figure imgf000034_0005
The fourth equality is due to Lemma 4. It should be noted that follows in the same way as the proof of Lemma
Figure imgf000034_0006
4. The last equality is obtained through algebraic manipulations. The resulting expression follows by substituting Equation 14 into Equation 12.
Theorem 2 below states the asymptotic statistical performance of Algorithm 3 used for estimating the cardinality of the set intersection as described herein above.
Theorem 2:
Algorithm 3 estimates
Figure imgf000034_0007
a with mean value and variance namely,
Figure imgf000034_0008
Figure imgf000034_0009
where satisfies the following condition:
Figure imgf000034_0010
Figure imgf000034_0011
Figure imgf000034_0012
Where
Figure imgf000034_0013
Theorem 2 may be proved as follows:
Figure imgf000035_0002
may be denoted
Figure imgf000035_0003
Similarly
Figure imgf000035_0004
may be denoted with the respective expression. Thus, the estimator in Algorithm 3 as expressed in Equation 7 may be rewritten as follows:
Figure imgf000035_0005
The asymptotic distribution of may be first analyzed. Recall that according to
Figure imgf000035_0006
Good-Turing Equation 15 below follows.
Equation 15:
Figure imgf000035_0007
Applying Lemma 1 on
Figure imgf000035_0008
the expectation may be expressed as:
Figure imgf000035_0009
The second equality follows by substituting and using Equation 15.
Figure imgf000035_0010
The variance may be expressed as
Figure imgf000035_0001
The first equality is due to the definition of The limit is because and
Figure imgf000035_0011
The last equality follows Lemma 3 and
Figure imgf000035_0012
Lemma 4. This may result in Equation 16 below.
Equation 16:
Figure imgf000035_0013
The asymptotic distribution of ¾ may be analyzed similarly. The estimator
Figure imgf000036_0001
in Algorithm 3 is now analyzed. Note that
Figure imgf000036_0002
are dependent variables.
In Lemma 5 we proved that The
Figure imgf000036_0003
expectation may therefore be expressed as:
Figure imgf000036_0004
It follows that is an unbiased estimator for . The variance may be expressed by Equation 17 below.
Equation 17:
Figure imgf000036_0005
Where and similarly for . The first equality is due to
Figure imgf000036_0006
variance properties and the second equality follows from Equation 16 and Lemma 5.
In total the estimator
Figure imgf000036_0007
is obtained, where is as stated in Equation
Figure imgf000036_0008
17. Applying Lemma 1 on the independent variables and a concludes the
Figure imgf000036_0009
proof.
Theorem 3 below states the asymptotic statistical performance of Algorithm 3 used for estimating the cardinality of the set difference as described herein above.
Theorem 3:
Algorithm 4 estimates
Figure imgf000036_0011
with mean value and variance namely,
Figure imgf000036_0012
Figure imgf000036_0013
, where
Figure imgf000036_0010
Figure imgf000037_0002
where s as stated in Theorem 2.
Theorem 3 below states the asymptotic statistical performance of Algorithm 4 used for estimating the cardinality of the set difference as described herein above. Lemma 6 below is used for the analysis.
Lemma 6:
In Equation 1 the estimation of is normally distributed with mean and variance
Figure imgf000037_0001
The same may apply for the estimation of and according to Equation 3, with the change of to and respectively.
Lemma 6 may be proved for where the proof for and is similar. As known in the art, for the hash function the following applies:
P
Figure imgf000037_0003
The intuition is considering the hash function
Figure imgf000037_0005
and defining
Figure imgf000037_0006
for every sample 0, as the element in the sample 13 whose hash value for
Figure imgf000037_0007
is maximum
Figure imgf000037_0008
Therefore
Figure imgf000037_0004
applies only when
Figure imgf000037_0009
lies in
. The probability for this condition is the Jaccard ratio , and therefore
Figure imgf000037_0010
From Equation and Equation 18 follows that is a sum of
Figure imgf000037_0011
3 Bernoulli variables. Therefore, it is binomially distributed, and can be asymptotically approximated to normal distribution as , namely,
Figure imgf000037_0012
Figure imgf000037_0013
Now analyzing Algorithm 5 used for estimating the cardinality of set expressions, in particular a set union, a set intersection and a set difference between 13 streams (13 13 ) as described herein above. Theorem 4 below states the asymptotic statistical performance of Algorithm 5.
Theorem 4:
Algorithm 5 estimates
Figure imgf000038_0001
with mean value
Figure imgf000038_0018
and variance— namely
Figure imgf000038_0002
Figure imgf000038_0004
where is the original full and un-sampled
Figure imgf000038_0003
stream is the length of the subsample stream 0 , as described in Procedure 5.
Figure imgf000038_0005
Theorem 4 may be proved as follows:
The estimator for the set expression between the 13 streams
Figure imgf000038_0006
recall that
Figure imgf000038_0007
According to Lemma 6, the term may be expressed as
According to Corollary 2, the term a may be expressed
Figure imgf000038_0010
Figure imgf000038_0009
Figure imgf000038_0008
Considering the product
Figure imgf000038_0011
then according to Lemma 1 and because the variables are independent, the following may be obtained:
Figure imgf000038_0012
Thus
Figure imgf000038_0013
Denoting
Figure imgf000038_0014
athe estimator is the final term in the estimator, i.e. is
Figure imgf000038_0016
Figure imgf000038_0015
now analyzed. According to Lemma Therefore
Figure imgf000038_0017
according to Lemma 1 and because the variables are independent, the following may be obtained:
Figure imgf000039_0001
The last equality is due to which follows from the teaching of Good-
Figure imgf000039_0002
Turing.
Simulation Tests
Simulation tests were conducted to validate the MTS methodology, in particular the Theorems presented herein above developed to analyze and prove the MTS Algorithms 1 , 2, 3, 4 and 5, in particular to validate the asymptotic bias and variance performance of the presented MTS algorithms. More specifically, the simulation was conducted to demonstrate the following:
Algorithms 3 and 4 are unbiased, as proven by Theorems 2 and 3.
The variance of Algorithms 3 and 4 is close to their analyzed variance in
Theorems 2 and 3.
The variance of Algorithm 5 is close to its analyzed variance in Theorem 4.
The simulations tests were conducted with the MTS Algorithms implementing the HyperLogLog as the CAR EST PROC procedure for estimating the cardinality.
The simulation tests for Algorithms 3 and 4 were conducted over two streams (sets), and , whose cardinalities are as follows:
Figure imgf000039_0003
Each distinct element appears times in the original un-sampled streams and . The f equencies are determined according to the following models known in the art:
Uniform distribution: The frequency
Figure imgf000040_0004
of the elements is uniformly distributed between 100 and 1, 000; i.e.,
Figure imgf000040_0003
Pareto distribution: The frequency
Figure imgf000040_0005
of the elements follows the heavy-tailed rule with shape parameter
Figure imgf000040_0008
and scale parameter
Figure imgf000040_0006
i.e., the frequency probability function is The
Figure imgf000040_0007
scale parameter
Figure imgf000040_0009
represents the smallest possible frequency. The Pareto distribution has several unique properties. In particular, if
Figure imgf000040_0011
the Pareto distribution has infinite variance, and if
Figure imgf000040_0010
, the Pareto distribution has infinite mean. As decreases, a larger portion of the probability mass is in the tail of the distribution, and the Pareto distribution is therefore useful when a small percentage of the population controls the majority of the measured quantity. Each of the simulation tests was repeated for 1 ,000 different streams (sets) and . Thus, for each of the simulated MTS Algorithms and for each value of a vector of 1 ,000 different estimations was produced. Then, for each value of , the variance and bias of this vector were computed and the results as presented herein after are considered as the variance and bias of the respective Algorithm for a specific value of . Each such computation is represented by one table row in Table 2, Table 3 and Table 4 below. The vector of estimations for a specific Algorithm and for a specific value of may be expressed as A mean of the vector may be expressed as 0
Figure imgf000040_0001
Figure imgf000040_0013
The bias and variance of 0 are computed as follows:
Figure imgf000040_0002
Figure imgf000040_0012
First presented are the simulation tests results for the bias of Algorithm 3 applied for estimating cardinality of a set intersection and Algorithm 4 applied for estimating cardinality of a set difference as described herein before. Table 2 below presents the simulation tests results for the bias of Algorithm 3 (Alg. 3) and Algorithm 4 (Alg. 4) for different values of using uniformly distributed frequencies
Figure imgf000041_0002
storage units (buckets) and ) and Pareto distributed frequencies
Figure imgf000041_0001
Figure imgf000041_0003
The sampling ratio is In each table row we present the bias.
Figure imgf000041_0004
Table 2:
Figure imgf000041_0012
As evident from the results in Table 2, the measured bias values are significantly low and practically tend to 0, indicating insignificant bias thus complying and in agreement with the analytical analysis for the bias of Algorithms 3 and 4. For the uniform distribution, the number of distinct elements
Figure imgf000041_0005
Thus, the expected length of each original stream is
Figure imgf000041_0006
. A total storage budget of
Figure imgf000041_0007
storage units per stream, which is about 0.006% of the stream length, yields accurate estimation for both set intersection (Alg. 3) and set difference (Alg. 4) cardinalities. For the Pareto distribution, the expected length of each original stream is 500 · 106. Using a total storage budget of
Figure imgf000041_0008
storage units, namely, of the stream length, yields significantly accurate estimations for both set intersection and set difference cardinalities.
Now presented are the simulation tests results for the variance of Algorithms 3 and 4. Table 3 and Table 4 below present simulation tests results for both Algorithms 3 and 4 for different values of using uniform and Pareto frequency distributions. In both tables, buckets and
Figure imgf000041_0009
. The sampling ratio is
Figure imgf000041_0011
A and two values of are used,
Figure imgf000041_0010
. The results are averaged over 1 , 000 runs of the simulation tests and the "analysis" variance is determined according to Theorems 2 and 3.
Figure imgf000042_0001
As can be seen in Table 3 and Table 4, the algorithm variance is always lower than 20% and in most cases lower than 10%, thus complying and in excellent agreement with the results expected by the analytical analysis.
Now presented are simulation test results for simulations of Algorithm 5 used for estimating the cardinality of set expression between 0 streams where
Figure imgf000043_0001
. The simulation tests aim to confirm Theorem 4 presented to analyze and theoretically verify Algorithm 5.
The simulation tests for Algorithms 5 were conducted over three streams (sets), , and , each with distinct elements and uniformly distributed frequencies as described herein above for the simulation of Algorithms 3 and 4. The simulation tests were conducted to estimate the cardinality of a set expression a a which, as known in the art, may be expressed
Figure imgf000043_0002
hi the simulation test described herein after we fix the cardinality of a a and estimate for different values of the intersection using
Figure imgf000043_0003
Figure imgf000043_0004
Algorithm 5. The three streams , and have the following cardinalities:
Figure imgf000043_0005
The simulation tests are conducted to estimate the cardinality
Figure imgf000043_0006
for different values of the intersection
Figure imgf000043_0007
Table 5 below presents the simulation tests results for different intersection values using uniform frequency distributions for
Figure imgf000043_0008
buckets and
Figure imgf000043_0009
. The sampling ratio is
Figure imgf000043_0010
. The results are averaged over 1, 000 runs of the simulation tests, and the "analysis" variance is determined according to Theorem 4.
Table 5:
Figure imgf000044_0008
As evident from the test results in Table, the relative error of the variance of Algorithm 5 as measured in the simulation tests is approximately 5% which is very similar to the variance expected by the analytical analysis. As expected, when the cardinality increases hence the estimated cardinality increases as
Figure imgf000044_0006
Figure imgf000044_0007
well), the variance decreases.
The MTS methodology may be applied to a plurality of applications in a wide variety of domains.
For example, the MTS methodology may be used for Query optimization which may be required by database systems to determine a best (low-cost) plan for processing queries. The query optimization may be processed by a query optimizer which estimates the cost of a plan according to the input/output cardinalities of each plan's operator. Accurate cardinality estimation of set expressions over table fields in one scan using fixed memory as done by the MTS based cardinality estimation for the set expressions may be significantly valuable for such query optimizers. For example, assuming three large relational databases,
Figure imgf000044_0001
with a shared field
Figure imgf000044_0004
In case of processing a query such as
Figure imgf000044_0002
, where 13 is a stream of tuples in the field
Figure imgf000044_0005
The database system may need to determine the best (low-cost) plan for processing this query. Using the MTS sketch, the query optimizer may efficiently estimate the cardinality of the set expression
Figure imgf000044_0003
In another example, in the case of a join query, by estimating the cardinality (number of distinct tuples) of each involved table, the database system may select the best join strategy. Moreover, by estimating the cardinality of a set intersection between all involved tables, the system may estimate the size of the outcome join. In another example, the MTS methodology may be used for network monitoring and security. Network management may require continuous measurement of multiple network parameters whose values may be efficiently estimated using the MTS sketch, using only a small portion of the monitored data. Moreover, by processing only a small portion of the traffic, real-time detection of anomalies may be feasible. For example, assuming a network where ΓΡ packets are received from different
Figure imgf000045_0004
routers Let 0 denote the set of EP packets (flows) that enter the network
Figure imgf000045_0001
through router . Two examples of monitoring applications may be as follows:
Figure imgf000045_0002
(a) Using the MTS sketch, the total number of IP packets that enter the network can be efficiently estimated by estimating the cardinality of the set union
Figure imgf000045_0003
(b) Assuming all incoming ΓΡ packets must pass through a firewall. This may be verified by verifying that the set
Figure imgf000045_0005
is empty, where 0 is the set of IP flows that enter the firewall. This verification may be efficiently done using the MTS sketch by estimating the cardinality
Figure imgf000045_0006
a and verifying that this cardinality tends to
0.
It is expected that during the life of a patent maturing from this application many relevant systems, methods and computer programs will be developed and the scope of the terms cardinality estimation procedure and sampling technique are intended to include all such new technologies a priori.
As used herein the term "about" refers to
Figure imgf000045_0007
The terms "comprises", "comprising", "includes", "including", "having" and their conjugates mean "including but not limited to".
The term "consisting of means "including and limited to".
As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.
Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "ranging/ranges between" a first indicate number and a second indicate number and "ranging/ranges from" a first indicate number "to" a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

Claims

WHAT IS CLAIMED IS:
1. A computer implemented method of estimating a cardinality of a stream, comprising:
using at least one processor configured to execute a code, the code is adapted for:
receiving a query for estimating a cardinality of a stream comprising a plurality of elements;
obtaining a sample comprising a group of the plurality of elements randomly sampled from the respective stream;
computing a first data structure and a second data structure for the sample, the first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample;
computing, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream; and
computing the estimated cardinality value of the stream by applying the ratio value to the estimated sample cardinality value.
2. The computer implemented method of claim I, wherein each of the plurality of elements includes at least one member of a group consisting of: a tuple, a word, a symbol, a binary representation, a numeral expression and an internet protocol (IP) packet.
3. A computer implemented method of estimating a cardinality of set expressions between streams, comprising:
using at least one processor configured to execute a code, the code is adapted for: receiving a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements;
obtaining a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream;
computing a first data structure and a second data structure for each of the plurality of samples, the first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample;
computing, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression; and
computing the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value.
4. The computer implemented method of claim 3, wherein the combination function is a union function to create a set union between the plurality of streams, the first data structure comprising the plurality of maximal hash values computed for a concatenation of the plurality of samples, the second data structure is created by selecting the fixed-size subset from the concatenation of the plurality of samples.
5. The computer implemented method of claim 3, wherein the combination function is an intersection function to create a set intersection between the plurality of streams, the sample cardinality is created for a set intersection between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples.
6. The computer implemented method of claim 3, wherein the combination function is a difference function to create a set difference between the plurality of streams, the sample cardinality is created for a set difference between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples.
7. A system for estimating a cardinality of a stream, comprising:
at least one processor adapted to execute code, the code comprising:
code instructions to receive a query for estimating a cardinality of a stream comprising a plurality of elements;
code instructions to obtain a sample comprising a group of the plurality of elements randomly sampled from the respective stream;
code instructions to compute a first data structure and a second data structure for the sample, the first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample;
code instructions to compute, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream; and
code instructions to compute the estimated cardinality value of the stream by applying the ratio value to the estimated sample cardinality value.
8. A system for estimating a cardinality of set expressions between streams, comprising:
at least one processor adapted to execute code, the code comprising:
code instructions to receive a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements; code instructions to obtain a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream;
code instructions to compute a first data structure and a second data structure for each of the plurality of samples, the first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample;
code instructions to compute, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression; and
code instructions to compute the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value.
PCT/IL2017/051134 2016-10-10 2017-10-10 Mts sketch for accurate estimation of set-expression cardinalities from small samples WO2018069928A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662406019P 2016-10-10 2016-10-10
US62/406,019 2016-10-10

Publications (1)

Publication Number Publication Date
WO2018069928A1 true WO2018069928A1 (en) 2018-04-19

Family

ID=61905256

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2017/051134 WO2018069928A1 (en) 2016-10-10 2017-10-10 Mts sketch for accurate estimation of set-expression cardinalities from small samples

Country Status (1)

Country Link
WO (1) WO2018069928A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114218292A (en) * 2021-11-08 2022-03-22 中国人民解放军国防科技大学 Multi-element time sequence similarity retrieval method
US11711310B2 (en) 2019-09-18 2023-07-25 Tweenznet Ltd. System and method for determining a network performance property in at least one network
US11716338B2 (en) 2019-11-26 2023-08-01 Tweenznet Ltd. System and method for determining a file-access pattern and detecting ransomware attacks in at least one computer network
CN117792962A (en) * 2024-02-28 2024-03-29 苏州大学 Distributed stream base measuring method, device and computer readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150100596A1 (en) * 2013-10-06 2015-04-09 Yahoo! Inc. System and method for performing set operations with defined sketch accuracy distribution

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150100596A1 (en) * 2013-10-06 2015-04-09 Yahoo! Inc. System and method for performing set operations with defined sketch accuracy distribution

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11711310B2 (en) 2019-09-18 2023-07-25 Tweenznet Ltd. System and method for determining a network performance property in at least one network
US11716338B2 (en) 2019-11-26 2023-08-01 Tweenznet Ltd. System and method for determining a file-access pattern and detecting ransomware attacks in at least one computer network
CN114218292A (en) * 2021-11-08 2022-03-22 中国人民解放军国防科技大学 Multi-element time sequence similarity retrieval method
CN114218292B (en) * 2021-11-08 2022-10-11 中国人民解放军国防科技大学 Multi-element time sequence similarity retrieval method
CN117792962A (en) * 2024-02-28 2024-03-29 苏州大学 Distributed stream base measuring method, device and computer readable storage medium

Similar Documents

Publication Publication Date Title
WO2018069928A1 (en) Mts sketch for accurate estimation of set-expression cardinalities from small samples
Huo et al. A SDN‐based fine‐grained measurement and modeling approach to vehicular communication network traffic
Fenu et al. Network analysis via partial spectral factorization and Gauss quadrature
US11204851B1 (en) Real-time data quality analysis
US20210135948A1 (en) Discovering a computer network topology for an executing application
EP3679473B1 (en) A system and method for stream processing
CN113746798B (en) Cloud network shared resource abnormal root cause positioning method based on multi-dimensional analysis
Hahn et al. Reachability and reward checking for stochastic timed automata
Lee et al. Computing the stationary distribution locally
US20170262500A1 (en) Processing a database query in a database system
EP3375142A1 (en) Managing network alarms
Caron et al. Some recent results in rare event estimation
Khomonenko et al. Performance evaluation of cloud computing accounting for expenses on information security
Tran et al. Conditioning and aggregating uncertain data streams: Going beyond expectations
Chen et al. An efficient solution to locate sparsely congested links by network tomography
Wang et al. Estimating multiclass service demand distributions using Markovian arrival processes
Cohen et al. Cardinality estimation meets good-turing
WO2016085443A1 (en) Application management based on data correlations
Zadorozhnyi et al. Methods of simulation queueing systems with heavy tails
Mokhlissi et al. The evaluation of the number and the entropy of spanning trees on generalized small-world networks
Kharchenko et al. Monte-Carlo simulation and availability assessment of the smart building automation systems considering component failures and attacks on vulnerabilities
Nakajima et al. Social graph restoration via random walk sampling
Nie et al. A compressive sensing‐based approach to end‐to‐end network traffic reconstruction utilising partial measured origin‐destination flows
Fernandes et al. Digital signature to help network management using principal component analysis and K-means clustering
Budić et al. Optimizing Mobile Radio Access Network Spectrum Refarming Using Community Detection Algorithms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17861119

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17861119

Country of ref document: EP

Kind code of ref document: A1