WO2018069928A1

WO2018069928A1 - Mts sketch for accurate estimation of set-expression cardinalities from small samples

Info

Publication number: WO2018069928A1
Application number: PCT/IL2017/051134
Authority: WO
Inventors: Reuven Cohen; Liran Katzir; Aviv YEHEZKEL
Original assignee: Technion Research & Development Foundation Limited
Priority date: 2016-10-10
Filing date: 2017-10-10
Publication date: 2018-04-19

Abstract

A computer implemented method of estimating a cardinality of a stream, comprising: receiving a query for estimating a cardinality of a stream comprising a plurality of elements, obtaining a sample comprising a group of the plurality of elements randomly sampled from the respective stream, computing a first and second data structures for the sample used to compute an estimated sample cardinality of the sample and a ratio indicative of a proportion between the estimated sample cardinality and the estimated cardinality of the stream and computing the estimated cardinality of the stream by applying the ratio to the estimated sample cardinality. Where the first data structure comprises a plurality of maximal hash values computed for the sample using a plurality of hash functions and the second data structure comprises a fixed- size subset of the elements having a minimal hash value among the elements of the group.

Description

MTS SKETCH FOR ACCURATE ESTIMATION OF SET-EXPRESSION

CARDINALITIES FROM SMALL SAMPLES

FIELD AND BACKGROUND OF THE INVENTION

The present invention, in some embodiments thereof, relates to estimating a cardinality of a single stream and/or set expressions between multiple streams and, more particularly, but not exclusively, to estimating a cardinality of a single stream and/or set expressions between multiple streams using a significantly small sample of each of the streams.

With the evolution of information technology, the amount of data that is processed and/or transferred is constantly growing presenting major challenges to multiple applications that may need to process extremely large volumes of data, where in many cases such processing may need to be done in real-time.

Therefore, multiple various methods, techniques, frameworks and/or the like are continually developed to support and enable such applications to process the increasing data volumes.

One or more of such data processing methodologies may include identifying the cardinality, i.e. the number of distinct elements in streams and/or sets comprising a plurality of elements with repetitions may be of major interest for multiple applications ranging from database queries to network traffic monitoring and network security applications.

SUMMARY OF THE INVENTION

According to a first aspect of the present invention there is provided a computer implemented method of estimating a cardinality of a stream, comprising using one or more processors configured to execute a code, the code is adapted for:

Receiving a query for estimating a cardinality of a stream comprising a plurality of elements.

Obtaining a sample comprising a group of the plurality of elements randomly sampled from the respective stream.

Computing a first data structure and a second data structure for the sample. The first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample.

Computing, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream.

Computing the estimated cardinality value of the stream by applying the ratio value to the estimated sample cardinality value.

The MTS based cardinality estimation may present significant advantages for computing an estimated cardinality for extremely large streams and moreover for set expressions of multiple streams using only a significantly small data portion of the stream(s). By accurately estimating the cardinality for the subsample of the sampled stream (sample) of the stream as done by the MTS algorithm, the computation resources, the storage resources and/or the processing time may be significantly reduced since the subsample is fixed in size and is significantly small. Moreover, by reducing the cardinality estimation problem for estimating the cardinality of the sample to estimating the cardinality of elements appearing only once in the sample the cardinality estimation may be significantly simplified thus further reducing computation resources, storage resources and/or processing time while maintaining high accuracy of the estimated cardinality.

According to a second aspect of the present invention there is provided a system for estimating a cardinality of a stream, comprising one or more processors adapted to execute code, the code comprising:

- Code instructions to receive a query for estimating a cardinality of a stream comprising a plurality of elements;

code instructions to obtain a sample comprising a group of the plurality of elements randomly sampled from the respective stream.

Code instructions to compute a first data structure and a second data structure for the sample. The first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample.

Code instructions to compute, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream.

Code instructions to compute the estimated cardinality value of the stream by applying the ratio value to the estimated sample cardinality value.

According to a third aspect of the present invention there is provided a computer implemented method of estimating a cardinality of set expressions between streams, comprising using one or more processors configured to execute a code, the code is adapted for:

Receiving a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements.

Obtaining a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream.

Computing a first data structure and a second data structure for each of the plurality of samples. The first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample.

Computing, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression. Computing the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value. Since the MTS sketch is additive in nature, the MTS algorithms used for estimating the cardinality of a single stream may be easily and efficiently extended for estimating the cardinality of set expressions of the streams, in particular, a set union, a set intersection and a set difference. This may allow computing an estimated cardinality even for set expression of multiple extremely large streams and/or sets of elements.

According to a fourth aspect of the present invention there is provided a system for estimating a cardinality of set expressions between streams, comprising one or more processors adapted to execute code, the code comprising:

Code instructions to receive a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements.

Code instructions to obtain a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream.

Code instructions to compute a first data structure and a second data structure for each of the plurality of samples. The first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample.

Code instructions to compute, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression.

Code instructions to compute the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value. In a further implementation form of the first, second, third and/or fourth aspects, each of the plurality of elements includes one or more members of a group consisting of: a tuple, a word, a symbol, a binary representation, a numeral expression and an internet protocol (IP) packet. The MTS sketch based cardinality estimation may be applied to estimate the cardinality of a diverse range of stream used by multiple applications which may be of very different nature. In particular, the type of the elements of the stream(s) may vary while the same concepts of the MTS sketch based cardinality estimation may apply.

In a further implementation form of the third and/or fourth aspects, the combination function is a union function to create a set union between the plurality of streams, the first data structure comprising the plurality of maximal hash values computed for a concatenation of the plurality of samples, the second data structure is created by selecting the fixed-size subset from the concatenation of the plurality of samples. The MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set union which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.

In a further implementation form of the third and/or fourth aspects, the combination function is an intersection function to create a set intersection between the plurality of streams, the sample cardinality is created for a set intersection between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples. The MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set intersection which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.

In a further implementation form of the third and/or fourth aspects, the combination function is a difference function to create a set difference between the plurality of streams, the sample cardinality is created for a set difference between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples. The MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set difference which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.

For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system, hi an exemplary embodiment of the invention, one or more tasks according to exemplary embodiments of method and/or system as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions. Optionally, the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data. Optionally, a network connection is provided as well. A display and/or a user input device such as a keyboard or mouse are optionally provided as well. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)

Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.

In the drawings:

FIG. 1 is a flowchart of an exemplary process of estimating a cardinality of set expressions between streams, according to some embodiments of the present invention;

FIG. 2 is a schematic illustration of an exemplary system for estimating a cardinality of set expressions between streams, according to some embodiments of the present invention; and

FIG. 3 is a schematic illustration of a sampled stream space. DESCRIPTION OF SPECIFIC EMBODIMENTS OF THE INVENTION

According to some embodiments of the present invention, there are provided methods, systems and computer program products for estimating a cardinality of a single stream and/or a set expression, in particular, a set union, a set intersection and a set difference between a plurality of streams (sets) using a significantly small sample of each of the streams. Each of the streams comprises a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an internet protocol (IP) packet and/or the like and the sample (sampled stream) of each stream comprises a group of elements randomly sampled from a respective stream.

Estimating the cardinality of streams as well as estimating the cardinality of set expressions between multiple streams may be useful for a plurality of applications ranging from data base queries to network traffic monitoring and security applications. However, computing a precise (exact) cardinality for the streams and moreover for the set expressions of the streams may be complex and costly at best and impractical at worst as often the streams may be extremely large. The cardinality computation may therefore require high computation resources, large storage resources and may further limit real-time computation. Estimating the cardinality of a stream using only a sample (sampled stream) of the stream which comprises elements randomly selected from the stream is known in the art. However, such estimation of extremely large streams may also require excessive computation and/or storage resources rendering the estimation impractical. Moreover, such estimation may not be applicable for the set expressions between multiple large streams.

According to some embodiments of the present invention, a Maximal-Term with Sample (MTS) methodology presents an MTS sketch used by MTS based algorithms which may be used for accurately estimating the cardinality of the streams as well as the cardinality of the set expression between the plurality of streams using only a significantly small subsample of each of the samples (sampled streams) of the streams.

The cardinality of the streams as and/or of the set expressions is estimated using an MTS sketch created for each of the samples. Each MTS sketch includes a first data structure (0 0121 ) and a second data structure (0 00 ). The first data structure (0 00 ) comprises a vector of maximal hash values computed for the elements in the respective sample using a plurality of hash functions. The second data structure (0 00 ) is a subsample of the respective sample and comprises a fixed-size subset of elements having the minimal maximal hash values among the elements of the respective sample.

For a single stream, an estimated sample cardinality is first computed for the first data structure (0 00 ), i.e. the maximal hash values of elements in the sample using one or more max-sketch cardinality estimation technique, as known in the art, for example, HyperLogLog algorithm and/or the like. Using the second data structure (0 00 ), i.e. the fixed-size subset of the sample of the stream and applying one or more frequency estimation techniques as known in the art, for example, Good-Turing frequency estimation, a ratio value is computed which estimates the proportion between cardinality of the elements appearing only once in the sampled stream (sample) and the cardinality of the elements appearing only once in the full (un-sampled) stream. As the MTS sketch is additive, The MTS methodology may efficiently extend the cardinality estimation to estimate the cardinality of the set expressions between the plurality of streams, i.e. multiple streams. First, the estimated cardinality may be computed for the set union which may be regarded as single" concatenated stream created by concatenating the plurality of streams. The same technique applied for the single stream may then apply for the concatenated stream. The MTS methodology further extends the cardinality estimation for the other set expression, in particular, the set intersection between the plurality of streams and the set difference between the plurality of streams. The estimated cardinality of the set intersection and/or the set difference may be derived from the cardinality estimation of the set union using set theorem conventions defining relations between the various set expressions, in particular, the Jaccard similarity statistics (also known as intersection over a set union and/or the Jaccard similarity coefficient) which are known in the art. In general the MTS sketch and algorithms may be used to estimate the cardinality of any sequence of set expressions between any number of streams using a small sample of each of the streams.

The Jaccard similarity may be computed for the plurality of streams and/or for the set expression, in particular, the set intersection and the set difference using the MTS sketch, i.e. the first data structure

and the second data structure created

for the samples and/or the set expression between the samples.

The MTS based cardinality estimation may present significant advantages for computing an estimated cardinality for extremely large streams and moreover for set expressions of multiple streams compared to existing methods, techniques and/or algorithms for computing and/or estimating the cardinality. Some of the existing methods may compute a precise cardinality for the stream by processing the entire un- sampled stream, i.e. analyzing each element in the stream. Such cardinality computation may require extremely high computation resources, storage resources and/or time thus rendering the cardinality computation inefficient, costly and may typically be impractical for extremely large streams. Other existing methods may apply one or more algorithms to compute an estimator for computing the cardinality of a sample of the stream, i.e. a sampled stream in order to estimate the cardinality of the stream. However, such algorithms may be sensitive to the order of the elements and/or to the repetition pattern of the elements. Moreover, in case of extremely large streams, in particular streams that need to be processed in real-time, the samples themselves may be significantly large thus requiring extensive computation and/or storage resources. Such algorithms may therefore not be suitable to real world applications in which large streams need to be processed in real time.

By accurately estimating the cardinality for the subsample of the sampled streams (samples) as done by the MTS algorithms, the computation resources, the storage resources and/or the processing time may be significantly reduced since the subsample is fixed in size and is significantly small.

Moreover, as the MTS sketch is additive in nature, the MTS algorithms may be easily and efficiently extended for estimating the cardinality of set expressions of the streams. This may allow computing an estimated cardinality even for set expression of multiple extremely large streams and/or sets of elements.

Furthermore, by reducing the cardinality estimation problem for estimating the cardinality of the sample(s) to estimating the cardinality of elements appearing only once in the sample and/or in the set expressions between the samples, the cardinality estimation may be significantly simplified thus further reducing computation resources, storage resources and/or processing time. However, while the cardinality estimation is significantly simplified, the accuracy of the estimation is maintained as presented herein after.

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.

The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions. Referring now to the drawings, FIG. 1 illustrates a flowchart of an exemplary process of estimating a cardinality of set expressions between streams, according to some embodiments of the present invention. An exemplary process 100 may be executed to estimate a cardinality of a stream (set) and/or of a set expression, in particular, a set union, a set intersection and a set difference between a plurality of streams (sets) each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like. The process 100 is applied to estimate the cardinality of the set expression using only a significantly small sample of each of the streams where each sample (sampled stream) comprises a group of elements randomly sampled from a respective stream.

The process 100 estimates the cardinality of the single stream and/or of the set expressions using an MTS sketch created for each of the samples where each of the MTS sketches includes a first data structure

and a second data structure

(subsample) computed for each of the samples. The process 100 computes an estimated sample cardinality for a single stream and/or for set expression(s) of the samples using the first data structure(s)

created for the samples by estimating the cardinality of the elements appearing once in the sample(s). The estimated cardinality of the sample and/or set expression(s) of the samples may be computed using one or more cardinality estimation tools as known in the art, for example, HyperLogLog algorithm and/or the like, The estimated sample cardinality is then applied with a computed ratio value which estimates the ratio (proportion) between the cardinality of the elements appearing only one in the sample compared to the cardinality of the elements appearing only once in the full stream. The ratio value is computed using the second data structure(s) and

applying one or more frequency estimation techniques as known in the art, for example, Good-Turing technique.

Reference is also made to FIG. 2, which is a schematic illustration of an exemplary system for estimating a cardinality of set expressions between streams, according to some embodiments of the present invention. An exemplary system 200 for executing a process such as the process 100 to estimate a cardinality of set expressions between streams (sets) comprises a computing node 201 for example, a computer, a server, a cluster of computing nodes and/or any device having one or more processors. The computing node 201 may typically include an input/output (I/O) interface 202 for obtaining a plurality of samples 220 of the plurality of streams, a processor(s) 204 and a storage 206.

The I/O interface 202 may provide one or more interconnect interfaces, for example, a network interface, a local interface and/or the like. The network interface may support one or more wired and/or wireless network interfaces for connecting to one or more networks, for example, a Local Area Network (LAN), a wide Area Network (WAN), a Wireless LAN (WLAN) (e.g. Wi-Fi), a cellular network and/or the like. The local interface may include one or more interfaces, for example, a Universal Serial Bus (USB) interface, a memory management controller (MMC) interface, a serial interface and/or the like for connecting to one or more peripheral devices, for example a storage device and/or the like.

The processor(s) 204, homogenous or heterogeneous, may be arranged for parallel processing, as clusters and/or as one or more multi core processor(s).

The storage 206 may include one or more computer readable medium devices, either persistent storage and/or volatile memory for one or more purposes, for example, storing program code, storing data, storing intermediate computation products and/or the like. The persistent storage may include one or more persistent memory devices, for example, a Flash array, a Solid State Disk (SSD) and/or the like for storing program code. The volatile memory may also include one or more volatile memory devices, for example, a Random Access Memory (RAM) device. The storage 206 may further include one or more networked storage resources, for example, a storage server, a Network Attached Storage (NAS) and/or the like accessible through the I/O interface 202.

The processor(s) 204 may execute one or more one or more software modules, for example, a process, an application, an agent, a utility, a script, a plug-in and/or the like. Wherein a software module may comprises a plurality of program instructions stored in a non-transitory medium such as the program store 206 and executed by a processor such as the processor(s) 204. The processors) 204 may execute, for example, a cardinality estimator 210 for estimating the cardinality of the set expression, in particular a set union, a set intersection and a set difference between a plurality of streams each comprising a plurality of elements, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like. In particular, the cardinality estimator 210 may estimate the cardinality of the set expression using only a significantly small sample 220 of each of the streams 220 obtained through the I/O interface 202 and/or from the storage 206. Optionally, the cardinality estimator 210 is executed by one or more virtual machines (VM) hosted by a computing node such as the computing node 201. Optionally, the cardinality estimator 210 is utilized as one or more remote services, for example, a remote server service, a cloud service, a Software as a Service (SaaS), a Platform as a Service (PaaS) and/or the like which are accessible over one or more networks from the computing node 201.

As shown at 102, the process 100 starts with the cardinality estimator 210 receiving a query for estimating a cardinality of a stream and/or of a set expression, in particular, a set union, a set intersection and a set difference between the plurality of streams each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.

As shown at 104, the cardinality estimator 210 obtains the sample 220 of the stream in case of the single stream and/or the samples 220 of the plurality of streams in case of the set expressions where each sample (sampled stream) 220 comprises a group of elements randomly sampled from the respective stream. The cardinality estimator 210 may obtain one or more of the samples 220 from one or more remote location, for example, a server, a cloud service, a cloud storage and/or the like which are accessible from the computing node 201 over one or more networks through the I/O interface 202. The cardinality estimator 210 may also obtain one or more of the samples 220 from the storage 206, either from a local storage and/or from a remote storage resource accessible through the I/O interface 202. For example, the cardinality estimator 210 may obtain the sample(s) 220 from a local hard drive. In another example, the cardinality estimator 210 may obtain the sample(s) 220 from a NAS and/or the like. In another example, the cardinality estimator 210 may obtain the sample(s) 220 from an attachable storage drive and/or the like.

As shown at 106, the cardinality estimator 210 computes a first data structure and a second data structure for each of the samples 220. The

computation of the first data structure(s)

and the second data structure(s)

is described in detail herein after.

As shown at 108, using the first data structure and the second data

structure

the cardinality estimator 210 computes:

(1) An estimated sample cardinality value for the sample 220 of the stream and/or of one or more of the set expressions between the samples 220. Using the first data structure the cardinality estimator 210 may apply one or more cardinality estimation

tools as known in the art, for example, the HyperLogLog algorithm, to estimate the cardinality value of the sample 220 in case of the single stream. For the set expressions, the cardinality estimator 210 may extend the cardinality estimation techniques applied to the single stream to compute the estimated sample cardinality value of a set union of the samples 220 which may be regarded as a concatenation of the samples 220. The cardinality estimator 210 may apply conventions of the set theorem including, for example, the Jaccard similarity for further extending the cardinality estimation for other set expressions, for example, the set intersection and/or the set difference.

(2) A ratio value estimating the ratio (proportion) between the estimated sample cardinality value of the sample 220 (single stream) and/or of the set expression of the samples 220 (set expression between multiple streams) and the estimated cardinality of the entire (un-sampled) stream and/or the set expression between the entire streams respectively. In particular the cardinality estimator 210 reduces the ratio value computation to estimation of cardinality of elements appearing only once in the second data structure The cardinality estimator 210 may apply one or

more techniques as known in the art, for example, Good- Turing frequency estimation technique to compute the ratio between the estimated sample cardinality value and the estimated cardinality value of the entire stream(s).

The computation of the estimated sample cardinality value and the computation of the ratio value is described in detail herein after, specifically, Algorithm 1 for the single stream, Algorithm 2 for the set union between two streams, Algorithm 3 for the set intersection between two streams and Algorithm 4 between two streams the set difference. To estimate the cardinality of the set expression between 0 streams where

the cardinality estimator 210 may apply Algorithm 5 extending Algorithms 2, 3 and/or 4 for the 0 streams.

As shown at 110, the cardinality estimator 210 applies, for example, multiplies the computed ratio value to the estimated cardinality computed for the sample 220 (single stream) and/or for the set expression between the samples 220 (multiple stream) to compute an estimated cardinality for the entire stream and/or for the set expression between the entire streams (multiple streams). The computation of the estimated cardinality for the set expression between the streams is described in detail herein after, specifically, Algorithm 1 for the single stream, Algorithm 2 for the set union between two streams, Algorithm 3 for the set intersection between two streams and Algorithm 4 between two streams the set difference. To estimate the cardinality of the set expression between streams where the cardinality estimator 210 may apply Algorithm 5

extending Algorithms 2, 3 and/or 4 for the 0 streams.

Preliminaries and Basis

Before describing one or more embodiments of the present invention some existing art techniques, methodologies and/or methods for estimating the cardinality are first described, in particular the Good-Turing frequency estimation technique and the Jaccard similarity statistic (also known as intersection over a set union and/or the Jaccard similarity coefficient).

The Good-Turing frequency estimation technique is useful in many language- related tasks where the problem is to determine the probability that a word appears in a document. Let

be a stream of elements possibly with repetitions, and

be the set of all different elements, such that

Suppose that we want to estimate the probability that a randomly chosen element from the stream 0

is 0 . A naive approach is to choose a sample

of elements from the stream 0, and then to set where denotes the number of appearances

of 0 in the sample 0. However, this approach may be inaccurate, because for each element 0 that does not appear even once in the sample

(i.e. an "unseen element"),

Let

be a set of elements that appear exactly

times in the sample Good-Turing frequency estimation claims that

is a consistent estimator for the probability that an element of

appears Etimes in the sample For the case where , the Good-Turing technique therefore suggests

that In other words, the hidden mass (i.e. the estimator for the hidden

elements) may be estimated using the relative frequency of the elements that appear exactly once in the sample

For example, if 1/10 of the elements in the sample

appear only once in the sample then approximately 1/10 of the elements in are

unseen elements, namely, they do not appear at all in the sample

Jaccard similarity, as known in the art, is defined as where

and are two finite streams (sets). The Jaccard similarity value ranges between 0, when the two streams and are completely different, and 1 , when the two streams and are identical. An efficient and accurate estimate of is known in the art and may be computed as follows. First, each element in the streams and is hashed into (0, 1). Then, the maximal value of each stream is taken as a sketch that represents the whole stream. As demonstrated in the art, the probability that the sketches of the streams and are equal is exactly . When only one hash function is used, the variance of the estimate of may be infinite. Thus, 0 hash functions may be used, and the sketch representing each of the streams is actually a vector of 0 maximal values. As demonstrated in the art, improved performance may be attained if instead of 0 hash functions only two hash functions with stochastic averaging are used.

This may be stated formally as follows. Given a stream

and 13 different hash functions

the maximal hash value for the 13 hash function

The sketch of the stream may be therefore expressed as and the sketch of the stream may be expressed

The two sketches can then be used as expressed in

Equation 1 below to estimate the Jaccard similarity of the streams and .

Equation 1:

where the indicator variable and 0 otherwise.

As known in the art, the Jaccard similarity may be generalized to set difference as expressed in Equation 2 below. Equation 2:

Thus, the estimator presented in Equation 1 may be generalized as expressed in Equation 3 below.

Equation 3:

where the indicator variable is 1 if and 0 otherwise. A similar

estimation may be performed for a set difference such as

. In order to simplify the notations, the notations , and are used herein after to indicate the Jaccard similarity variables ,

and respectively.

MTS Based Cardinality Estimation for a Set-Expression

According to some embodiments of the present inventions, the MTS methodology may be used to accurately estimate cardinality for set expressions of a plurality of streams using only a small sample of each of the streams. The set expressions, for example, a set union, a set intersection, a set difference and/or the like are created by applying one or more combination functions, for example, a union, an intersection and a difference respectively to the plurality of streams.

The (MTS) methodology and algorithms utilizing the MTS sketch are first presented for estimating the cardinality of set expression of two streams and are extended to set expression of the plurality of streams hereinafter.

Table 1 below presents some notations used herein after.

Table 1 :

Estimating the cardinality for a single stream using a generic scheme that combines a sampling process with a cardinality estimation procedure of a single stream as known in the art may consist of two steps: (a) using one or more cardinality estimators as known in the art for estimating cardinality of a sampled stream comprising samples of the original stream; and (b) estimating a sampling ratio, namely, the factor by which the cardinality of the sampled stream should be multiplied in order to estimate the cardinality of the full original stream. Such estimation is typically based on storing a small fixed-size subsample of the sampled stream and using it to estimate the probability of unseen elements using the Good-Turing technique.

According to some embodiments of the present invention, the scheme used for estimating cardinality of the single stream may be generalized to set expressions between multiple streams. The cardinality estimation is based on maintaining an MTS sketch for each of the plurality of streams which comprises a small fixed-size subsample of the sampled stream (i.e. the sample

and using this subsample for estimating the probability of unseen elements. To this end, the MTS sketch stores two data structures for each sampled stream (sample a first data structure and a second data

structure where

includes the maximal hash value for each hash function:

comprises a small fixed-size uniform subsample

of the sample

Reference is now made to FIG. 3, which is a schematic illustration of a sampled stream space. Illustration 300 presents a stream

comprising a plurality of elements

for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like. A sample 0 is a sampled stream of the stream

which comprises E elements randomly sampled from the stream such that

Assuming that the sampling rate is 0, the sample includes

of the elements of A subsample 0 includes part of the

sample in particular, the subsample E may include Ξ elements of the sample Ξ which have minimal maximal hash values among the elements of the sample 0. The subsample 0 may be generated using, for example, one-pass reservoir sampling as known in the art. Using the one-pass reservoir sampling implementation, first, the subsample is

initialized with the first

elements of the sample E, namely, and the

elements are then sorted in decreasing order of their hash values. When a new element is sampled into the sample the hash value of the newly sampled element is compared to

the current maximal hash value of the elements in the subsample

In case the hash value of the new element is smaller than the current maximal hash value of the elements in the subsample the new element is stored in the subsample instead of the element

having the maximal hash value. Otherwise, the new element is ignored. After all elements of the sample

are processed, the subsample E stores the E elements whose hash values were minimum, and it can be considered as a uniform subsample of length

It should be noted that MTS sketch is additive, i.e., the MTS sketch of a set union of a plurality of streams may be computed directly from the MTS sketches of the streams. Corollary 1 below summarizes this additivity property for two streams, which may be generalized for any streams.

Corollary 1 :

Assuming and are two streams (sets) with samples designated and

respectively, the MTS sketches of the streams and are:

Then, the MTS sketch of may be expressed as:

Cardinality Estimation for a Single Stream

The MTS methodology is first described for a single stream. As described here in above, estimating the cardinality for the single stream may be done by applying the Good-Turing technique to combine the sampling process with the generic cardinality estimation procedure of the single stream. The Good-Turing algorithm may receive the sampled stream (i.e. the sample as an input and returns an estimate for the cardinality The Good- Turing algorithm consists of two steps: (a) estimating a cardinality of the sample 13 using any procedure for estimating the cardinality of a single stream without sampling as known in the art, the procedure is designated CAR EST PROC herein after; and (b) estimating ration factor

by which the cardinality of the sample

(sampled stream) should be multiplied in order to estimate the cardinality of the full original stream

To estimate

in step (a), CAR EST PROC is invoked using

storage units To estimate in step (b), it is noted that the probability

for unseen elements in the stream may be expressed as

Therefore, the problem of estimating

may be reduced to estimating the probability of

unseen elements. According to the Good-Turing technique, is a consistent

estimator for

as described herein above. Thus, identifying the number

elements that appear exactly once in the sampled stream may be sufficient for estimating the cardinality of the stream

To compute the value precisely as known in the

art, all the elements in the sample

may need to be tracked and while ignoring each previously encountered element. To this end,

storage units may be needed, which is linear in the sample size and is therefore not scalable.

However, the storage elements number as well as processing resources may be significantly reduced thus reducing cost, complexity, time and/or the like by reducing the estimation problem to computing an approximation of the value of

using the subsample of the sample according to some embodiments of the present

invention. This algorithm for estimating the cardinality of a single stream

using the MTS sketch may be formulated by algorithm 1 below which utilizes procedure 1 below for estimating

Algorithm 1:

Procedure 1 :

Cardinality Estimation for a Set Union between Two Streams

According to some embodiments of the present invention, algorithm 1 may be extended for estimating the cardinality of a set union of the two stream and Assuming the samples be the samples (sampled streams) of the streams and

respectively. Let

be the concatenation of the samples

The concatenation is actually a sample of , i.e (refer to Table 1 for

the notation). Thus, estimating the cardinality of is equivalent to estimating the cardinality of a single stream using the concatenation

Estimating the estimating the cardinality of using the samples may be done using

Algorithm 2 below which in turn may use Algorithm 1 for processing the MTS sketch of the concatenation 13 .

Algorithm 2:

Cardinality Estimation of a Set Intersection between Two Streams

According to some embodiments of the present invention, algorithm 1 and algorithm 2 may be extended for estimating the cardinality of a set intersection of the two streams and . As known in the art a

a , where is the Jaccard similarity of the two full streams and . Algorithm 2 may therefore be used for estimating a a while the Jaccard similarity for the streams and needs to be estimated. As known in the art the Jaccard similarity may be expressed as shown in Equation 4 below.

Equation 4:

Equation 5 below may be formulated according to Good-Turing (refer to Table 1 for the notations).

Equation 5:

Similar equations may be formulated to express Substituting

Equation 5 into Equation 4 may produce Equation 6 below.

Equation 6:

or equivalently

Denoting (refer to Table 1 for the notations), Equation 6

may be rewritten as expressed in Equation 7 below.

Equation 7:

Algorithm 3 below may be used for estimating the cardinality a a of the set intersection of using the samples

In algorithm 3,

may be estimated using Procedure 1. Additionally,

may also be estimated using Procedure 1 using the

. Finally, may be estimated from and

using Procedure 3 below.

Algorithm 3:

Procedure 3 :

Cardinality Estimation of a Set Difference between Two Streams

According to some embodiments of the present invention, algorithm 1 and algorithm 2 may be similarly extended for estimating the cardinality of a set difference of the two streams and . As known in the art a

a , where 0 according to Equation 2. Thus, Algorithm 3 may be used for estimating the cardinality a a of the set difference using the samples C¾ and ¾, with the only difference being that the Jaccard similarity variable is estimated rather than .

Applying the inclusion-exclusion principle and some algebraic manipulations, the variable may be formulated as expressed in Equation 8 below.

Equation 8:

By substituting Equation 5 into Equation 8, Equation 9 may follow.

Equation 9:

Using the notations of Table 1 , where Equation 9 may be rewritten as

expressed in Equation 10 below.

Equation 10:

Algorithm 4 below which is an adjustment of Algorithm 3 may be used for estimating the cardinality a a of the set difference of using the samples

and In algorithm 4 may be estimated using Procedure 1. In addition,

may be estimated using Procedure 3.

Algorithm 4:

MTS Based Cardinality Estimation for a Set Expression between Multiple Streams

According to some embodiments of the present invention the MTS methodology, in particular Algorithm 1, Algorithm 2, Algorithm 3 and/or Algorithm 4 may be extended to estimate the cardinality of set expressions between

streams, where

. Assuming are

streams, and

are the respective samples, i.e. their respective sampled streams. The samples may be used to estimate the

cardinality of

. As presented herein above for the case of the two streams and , the sample 0 may be expressed as

021 . The cardinalities of the stream 0 and the sample

may be denoted by

respectively. Denoting as a "generalized" Jaccard similarity the "generalized" Jaccard similarity may be estimated from

in a similar way to the estimation of in Equation 1 as shown

in Equation 11 below. Equation 11:

Where the indicator variable

is 1 if, for the

hash function, satisfy the condition implied by the set expressions, and is 0

otherwise.

Using algebraic manipulations and the definition of the following expression may be obtained:

Thus, may be estimated using the following Equation:

Algorithm 5 below may be used for estimating the cardinality of the set

expression

between the

streams with sampling using the MTS sketch methodology. Algorithm 5 consists of three steps: (a) using Equation 11 to estimate ; (b) using CAR EST PROC to estimate

and (c) using Procedure 5 to estimate— , the factor (ratio) by which the cardinality 12 of the sampled stream 13 should be multiplied in order to estimate the cardinality 13 of the full stream 13.

Algorithm 5 may use Procedure 5 below for estimating

Algorithm 5:

Analytical Analysis

The correctness of the MTS methodology, in particular, the correctness of Algorithm 1, Algorithm 2, Algorithm 3, Algorithm 4 and Algorithm 5 may be verified through an analytical analysis. In order to simplify the notations, the notation 0 to denote the estimated cardinality in each of the Algorithms.

Lemma 1 is presented to describe how to compute probability distribution of a product of two normally distributed random variables whose covariance is 0.

Lemma 1 (Product distribution):

Assuming

are two random variables satisfying the condition

, and then as known in the art, the product asymptotically satisfies

the following condition:

For the analysis, the HyperLogLog algorithm as known in the art is used for the CAR EST PROC procedure in the MTS based Algorithms described herein above. The HyperLogLog estimator belongs to a family of sketches and is may present

improved cardinality estimation compared to other estimators known in the art. The standard error of the HyperLogLog estimator is

represents a number of storage units (e.g. registers) used for the estimation procedure. Pseudo-code of the HyperLogLog procedure is presented in Algorithm 6 below. Algorithm 6:

Lemma 2 below summarizes the statistical performance of Algorithm 6 without sampling, i.e., when the algorithm processes the entire stream.

Lemma 2:

For Algorithm 6, as known in the art where is the actual cardinality of

the considered set, 13 is the estimated cardinality computed using Algorithm 6, and

is the number of storage units used by Algorithm 6.

Corollary 2:

Let and be two streams. When Algorithm 6 is used with 13 storage units and without sampling, the following applies:

As presented in the art, the asymptotic bias and variance of Algorithm 1 was analyzed when using the HyperLogLog algorithm as the CAR EST PROC. It was demonstrated that the sampling rate does not affect the asymptotic unbiasedness of the estimator. The effect of the sampling rate on the estimator's variance was further analyzed with respect to the storage sizes 0 and 13. The following theorem summarizes the statistical performance of Algorithm 1. Theorem 1:

As proved in the art, Algorithm 1 estimates

with mean value

and variance namely, where

In addition, as shown in the art, and satisfy the following

conditions:

where are the distinct elements in the original (un-sampled) stream 13,

and is the frequency of element 0 in stream 0.

As described herein above estimating the set union cardinality using Algorithm 2 is equivalent to estimating the cardinality of a single stream based on its sampled stream 0 . Thus, the statistical performance of Algorithm 2 is equal to that of Algorithm 1.

Corollary 3:

Algorithm 2 estimates

a with mean value

and variance

namely,

where 0 and 0 are as stated in Theorem 1 with respect to the union stream

The following Lemmas, i.e. Lemma 3, Lemma 4 and Lemma 5 are used herein after for the analysis of the performance of Algorithm 3 and Algorithm 4 using the MTS sketch.

Lemma 3:

As proved in the art where

is the length of the subsample

Lemma 4:

Procedure 3 estimates with mean value and variance namely,

where s the cardinality of

Lemma 4 may be proved as follows:

Procedure 3 estimates

j)_eno^_{e me} distnict elements in the union subsample as For each the probability that

belongs to may be expressed as follows:

It follows that is a sum of Bernoulli variables with success probability ¾.

Therefore, it is binomially distributed, and can be asymptotically approximated using normal distribution as

Lemma 5:

The covariance (defined similarly)

satisfies the

cardinality of

Lemma 5 may be proved as follows:

Recall that

and similarly for The dependence is between

thus Equation 12 below follows from covariance properties.

Equation 12:

The distinct elements in the union subsample may be denoted

As shown in Procedure 3, may be written as expressed in Equation 13 below.

Equation 1 3 :

where is an indicator variable that gets 1 and 0 otherwise.

Similarly may be rewritten using indicator variables that gets 1 if and 0

otherwise.

Using covariance properties and Equation 13 the covariance may be expressed as shown in Equation 14 below.

Equation 14:

The first and third equalities are due to covariance properties. The second equality is due to the independence

The fourth equality is due to Lemma 4. It should be noted that follows in the same way as the proof of Lemma

4. The last equality is obtained through algebraic manipulations. The resulting expression follows by substituting Equation 14 into Equation 12.

Theorem 2 below states the asymptotic statistical performance of Algorithm 3 used for estimating the cardinality of the set intersection as described herein above.

Theorem 2:

Algorithm 3 estimates

a with mean value and variance namely,

where satisfies the following condition:

Where

Theorem 2 may be proved as follows:

may be denoted

Similarly

may be denoted with the respective expression. Thus, the estimator in Algorithm 3 as expressed in Equation 7 may be rewritten as follows:

The asymptotic distribution of may be first analyzed. Recall that according to

Good-Turing Equation 15 below follows.

Equation 15:

Applying Lemma 1 on

the expectation may be expressed as:

The second equality follows by substituting and using Equation 15.

The variance may be expressed as

The first equality is due to the definition of The limit is because and

The last equality follows Lemma 3 and

Lemma 4. This may result in Equation 16 below.

Equation 16:

The asymptotic distribution of ¾ may be analyzed similarly. The estimator

in Algorithm 3 is now analyzed. Note that

are dependent variables.

In Lemma 5 we proved that The

expectation may therefore be expressed as:

It follows that is an unbiased estimator for . The variance may be expressed by Equation 17 below.

Equation 17:

Where and similarly for . The first equality is due to

variance properties and the second equality follows from Equation 16 and Lemma 5.

In total the estimator

is obtained, where is as stated in Equation

17. Applying Lemma 1 on the independent variables and a concludes the

proof.

Theorem 3 below states the asymptotic statistical performance of Algorithm 3 used for estimating the cardinality of the set difference as described herein above.

Theorem 3:

Algorithm 4 estimates

with mean value and variance namely,

, where

where s as stated in Theorem 2.

Theorem 3 below states the asymptotic statistical performance of Algorithm 4 used for estimating the cardinality of the set difference as described herein above. Lemma 6 below is used for the analysis.

Lemma 6:

In Equation 1 the estimation of is normally distributed with mean and variance

The same may apply for the estimation of and according to Equation 3, with the change of to and respectively.

Lemma 6 may be proved for where the proof for and is similar. As known in the art, for the hash function the following applies:

P

The intuition is considering the hash function

and defining

for every sample 0, as the element in the sample 13 whose hash value for

is maximum

Therefore

applies only when

lies in

. The probability for this condition is the Jaccard ratio , and therefore

From Equation and Equation 18 follows that is a sum of

3 Bernoulli variables. Therefore, it is binomially distributed, and can be asymptotically approximated to normal distribution as , namely,

Now analyzing Algorithm 5 used for estimating the cardinality of set expressions, in particular a set union, a set intersection and a set difference between 13 streams (13 13 ) as described herein above. Theorem 4 below states the asymptotic statistical performance of Algorithm 5.

Theorem 4:

Algorithm 5 estimates

with mean value

and variance— namely

where is the original full and un-sampled

stream is the length of the subsample stream 0 , as described in Procedure 5.

Theorem 4 may be proved as follows:

The estimator for the set expression between the 13 streams

recall that

According to Lemma 6, the term may be expressed as

According to Corollary 2, the term a may be expressed

Considering the product

then according to Lemma 1 and because the variables are independent, the following may be obtained:

Thus

Denoting

athe estimator is the final term in the estimator, i.e. is

now analyzed. According to Lemma Therefore

according to Lemma 1 and because the variables are independent, the following may be obtained:

The last equality is due to which follows from the teaching of Good-

Turing.

Simulation Tests

Simulation tests were conducted to validate the MTS methodology, in particular the Theorems presented herein above developed to analyze and prove the MTS Algorithms 1 , 2, 3, 4 and 5, in particular to validate the asymptotic bias and variance performance of the presented MTS algorithms. More specifically, the simulation was conducted to demonstrate the following:

Algorithms 3 and 4 are unbiased, as proven by Theorems 2 and 3.

The variance of Algorithms 3 and 4 is close to their analyzed variance in

Theorems 2 and 3.

The variance of Algorithm 5 is close to its analyzed variance in Theorem 4.

The simulations tests were conducted with the MTS Algorithms implementing the HyperLogLog as the CAR EST PROC procedure for estimating the cardinality.

The simulation tests for Algorithms 3 and 4 were conducted over two streams (sets), and , whose cardinalities are as follows:

Each distinct element appears times in the original un-sampled streams and . The f equencies are determined according to the following models known in the art:

Uniform distribution: The frequency

of the elements is uniformly distributed between 100 and 1, 000; i.e.,

Pareto distribution: The frequency

of the elements follows the heavy-tailed rule with shape parameter

and scale parameter

i.e., the frequency probability function is The

scale parameter

represents the smallest possible frequency. The Pareto distribution has several unique properties. In particular, if

the Pareto distribution has infinite variance, and if

, the Pareto distribution has infinite mean. As decreases, a larger portion of the probability mass is in the tail of the distribution, and the Pareto distribution is therefore useful when a small percentage of the population controls the majority of the measured quantity. Each of the simulation tests was repeated for 1 ,000 different streams (sets) and . Thus, for each of the simulated MTS Algorithms and for each value of a vector of 1 ,000 different estimations was produced. Then, for each value of , the variance and bias of this vector were computed and the results as presented herein after are considered as the variance and bias of the respective Algorithm for a specific value of . Each such computation is represented by one table row in Table 2, Table 3 and Table 4 below. The vector of estimations for a specific Algorithm and for a specific value of may be expressed as A mean of the vector may be expressed as 0

The bias and variance of 0 are computed as follows:

First presented are the simulation tests results for the bias of Algorithm 3 applied for estimating cardinality of a set intersection and Algorithm 4 applied for estimating cardinality of a set difference as described herein before. Table 2 below presents the simulation tests results for the bias of Algorithm 3 (Alg. 3) and Algorithm 4 (Alg. 4) for different values of using uniformly distributed frequencies

storage units (buckets) and ) and Pareto distributed frequencies

The sampling ratio is In each table row we present the bias.

Table 2:

As evident from the results in Table 2, the measured bias values are significantly low and practically tend to 0, indicating insignificant bias thus complying and in agreement with the analytical analysis for the bias of Algorithms 3 and 4. For the uniform distribution, the number of distinct elements

Thus, the expected length of each original stream is

. A total storage budget of

storage units per stream, which is about 0.006% of the stream length, yields accurate estimation for both set intersection (Alg. 3) and set difference (Alg. 4) cardinalities. For the Pareto distribution, the expected length of each original stream is 500 · 106. Using a total storage budget of

storage units, namely, of the stream length, yields significantly accurate estimations for both set intersection and set difference cardinalities.

Now presented are the simulation tests results for the variance of Algorithms 3 and 4. Table 3 and Table 4 below present simulation tests results for both Algorithms 3 and 4 for different values of using uniform and Pareto frequency distributions. In both tables, buckets and

. The sampling ratio is

A and two values of are used,

. The results are averaged over 1 , 000 runs of the simulation tests and the "analysis" variance is determined according to Theorems 2 and 3.

As can be seen in Table 3 and Table 4, the algorithm variance is always lower than 20% and in most cases lower than 10%, thus complying and in excellent agreement with the results expected by the analytical analysis.

Now presented are simulation test results for simulations of Algorithm 5 used for estimating the cardinality of set expression between 0 streams where

. The simulation tests aim to confirm Theorem 4 presented to analyze and theoretically verify Algorithm 5.

The simulation tests for Algorithms 5 were conducted over three streams (sets), , and , each with distinct elements and uniformly distributed frequencies as described herein above for the simulation of Algorithms 3 and 4. The simulation tests were conducted to estimate the cardinality of a set expression a a which, as known in the art, may be expressed

hi the simulation test described herein after we fix the cardinality of a a and estimate for different values of the intersection using

Algorithm 5. The three streams , and have the following cardinalities:

The simulation tests are conducted to estimate the cardinality

for different values of the intersection

Table 5 below presents the simulation tests results for different intersection values using uniform frequency distributions for

buckets and

. The sampling ratio is

. The results are averaged over 1, 000 runs of the simulation tests, and the "analysis" variance is determined according to Theorem 4.

Table 5:

As evident from the test results in Table, the relative error of the variance of Algorithm 5 as measured in the simulation tests is approximately 5% which is very similar to the variance expected by the analytical analysis. As expected, when the cardinality increases hence the estimated cardinality increases as

well), the variance decreases.

The MTS methodology may be applied to a plurality of applications in a wide variety of domains.

For example, the MTS methodology may be used for Query optimization which may be required by database systems to determine a best (low-cost) plan for processing queries. The query optimization may be processed by a query optimizer which estimates the cost of a plan according to the input/output cardinalities of each plan's operator. Accurate cardinality estimation of set expressions over table fields in one scan using fixed memory as done by the MTS based cardinality estimation for the set expressions may be significantly valuable for such query optimizers. For example, assuming three large relational databases,

with a shared field

In case of processing a query such as

, where 13 is a stream of tuples in the field

The database system may need to determine the best (low-cost) plan for processing this query. Using the MTS sketch, the query optimizer may efficiently estimate the cardinality of the set expression

In another example, in the case of a join query, by estimating the cardinality (number of distinct tuples) of each involved table, the database system may select the best join strategy. Moreover, by estimating the cardinality of a set intersection between all involved tables, the system may estimate the size of the outcome join. In another example, the MTS methodology may be used for network monitoring and security. Network management may require continuous measurement of multiple network parameters whose values may be efficiently estimated using the MTS sketch, using only a small portion of the monitored data. Moreover, by processing only a small portion of the traffic, real-time detection of anomalies may be feasible. For example, assuming a network where ΓΡ packets are received from different

routers Let 0 denote the set of EP packets (flows) that enter the network

through router . Two examples of monitoring applications may be as follows:

(a) Using the MTS sketch, the total number of IP packets that enter the network can be efficiently estimated by estimating the cardinality of the set union

(b) Assuming all incoming ΓΡ packets must pass through a firewall. This may be verified by verifying that the set

is empty, where 0 is the set of IP flows that enter the firewall. This verification may be efficiently done using the MTS sketch by estimating the cardinality

a and verifying that this cardinality tends to

0.

It is expected that during the life of a patent maturing from this application many relevant systems, methods and computer programs will be developed and the scope of the terms cardinality estimation procedure and sampling technique are intended to include all such new technologies a priori.

As used herein the term "about" refers to

The terms "comprises", "comprising", "includes", "including", "having" and their conjugates mean "including but not limited to".

The term "consisting of means "including and limited to".

As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "ranging/ranges between" a first indicate number and a second indicate number and "ranging/ranges from" a first indicate number "to" a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

Claims

WHAT IS CLAIMED IS:

1. A computer implemented method of estimating a cardinality of a stream, comprising:

using at least one processor configured to execute a code, the code is adapted for:

receiving a query for estimating a cardinality of a stream comprising a plurality of elements;

obtaining a sample comprising a group of the plurality of elements randomly sampled from the respective stream;

computing a first data structure and a second data structure for the sample, the first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample;

computing, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream; and

2. The computer implemented method of claim I, wherein each of the plurality of elements includes at least one member of a group consisting of: a tuple, a word, a symbol, a binary representation, a numeral expression and an internet protocol (IP) packet.

3. A computer implemented method of estimating a cardinality of set expressions between streams, comprising:

using at least one processor configured to execute a code, the code is adapted for: receiving a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements;

obtaining a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream;

computing a first data structure and a second data structure for each of the plurality of samples, the first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample;

computing, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression; and

computing the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value.

4. The computer implemented method of claim 3, wherein the combination function is a union function to create a set union between the plurality of streams, the first data structure comprising the plurality of maximal hash values computed for a concatenation of the plurality of samples, the second data structure is created by selecting the fixed-size subset from the concatenation of the plurality of samples.

5. The computer implemented method of claim 3, wherein the combination function is an intersection function to create a set intersection between the plurality of streams, the sample cardinality is created for a set intersection between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples.

6. The computer implemented method of claim 3, wherein the combination function is a difference function to create a set difference between the plurality of streams, the sample cardinality is created for a set difference between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples.

7. A system for estimating a cardinality of a stream, comprising:

at least one processor adapted to execute code, the code comprising:

code instructions to receive a query for estimating a cardinality of a stream comprising a plurality of elements;

code instructions to obtain a sample comprising a group of the plurality of elements randomly sampled from the respective stream;

code instructions to compute a first data structure and a second data structure for the sample, the first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample;

code instructions to compute, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream; and

8. A system for estimating a cardinality of set expressions between streams, comprising:

at least one processor adapted to execute code, the code comprising:

code instructions to receive a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements; code instructions to obtain a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream;

code instructions to compute a first data structure and a second data structure for each of the plurality of samples, the first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample;

code instructions to compute, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression; and

code instructions to compute the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value.