WO2018069928A1 - Mts sketch for accurate estimation of set-expression cardinalities from small samples - Google Patents
Mts sketch for accurate estimation of set-expression cardinalities from small samples Download PDFInfo
- Publication number
- WO2018069928A1 WO2018069928A1 PCT/IL2017/051134 IL2017051134W WO2018069928A1 WO 2018069928 A1 WO2018069928 A1 WO 2018069928A1 IL 2017051134 W IL2017051134 W IL 2017051134W WO 2018069928 A1 WO2018069928 A1 WO 2018069928A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cardinality
- sample
- data structure
- stream
- value
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9014—Indexing; Data structures therefor; Storage structures hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
Definitions
- the present invention in some embodiments thereof, relates to estimating a cardinality of a single stream and/or set expressions between multiple streams and, more particularly, but not exclusively, to estimating a cardinality of a single stream and/or set expressions between multiple streams using a significantly small sample of each of the streams.
- One or more of such data processing methodologies may include identifying the cardinality, i.e. the number of distinct elements in streams and/or sets comprising a plurality of elements with repetitions may be of major interest for multiple applications ranging from database queries to network traffic monitoring and network security applications.
- a computer implemented method of estimating a cardinality of a stream comprising using one or more processors configured to execute a code, the code is adapted for:
- the first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions.
- the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample.
- the MTS based cardinality estimation may present significant advantages for computing an estimated cardinality for extremely large streams and moreover for set expressions of multiple streams using only a significantly small data portion of the stream(s).
- the computation resources, the storage resources and/or the processing time may be significantly reduced since the subsample is fixed in size and is significantly small.
- the cardinality estimation may be significantly simplified thus further reducing computation resources, storage resources and/or processing time while maintaining high accuracy of the estimated cardinality.
- a system for estimating a cardinality of a stream comprising one or more processors adapted to execute code, the code comprising:
- code instructions to obtain a sample comprising a group of the plurality of elements randomly sampled from the respective stream.
- Code instructions to compute a first data structure and a second data structure for the sample comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions.
- the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample.
- a computer implemented method of estimating a cardinality of set expressions between streams comprising using one or more processors configured to execute a code, the code is adapted for:
- the first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions.
- the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample.
- an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression.
- Computing the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value. Since the MTS sketch is additive in nature, the MTS algorithms used for estimating the cardinality of a single stream may be easily and efficiently extended for estimating the cardinality of set expressions of the streams, in particular, a set union, a set intersection and a set difference. This may allow computing an estimated cardinality even for set expression of multiple extremely large streams and/or sets of elements.
- a system for estimating a cardinality of set expressions between streams comprising one or more processors adapted to execute code, the code comprising:
- Code instructions to compute a first data structure and a second data structure for each of the plurality of samples comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions.
- the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample.
- each of the plurality of elements includes one or more members of a group consisting of: a tuple, a word, a symbol, a binary representation, a numeral expression and an internet protocol (IP) packet.
- IP internet protocol
- the MTS sketch based cardinality estimation may be applied to estimate the cardinality of a diverse range of stream used by multiple applications which may be of very different nature.
- the type of the elements of the stream(s) may vary while the same concepts of the MTS sketch based cardinality estimation may apply.
- the combination function is a union function to create a set union between the plurality of streams, the first data structure comprising the plurality of maximal hash values computed for a concatenation of the plurality of samples, the second data structure is created by selecting the fixed-size subset from the concatenation of the plurality of samples.
- the MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set union which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
- the combination function is an intersection function to create a set intersection between the plurality of streams
- the sample cardinality is created for a set intersection between the second data structure of the plurality of samples
- the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples.
- the MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set intersection which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
- the combination function is a difference function to create a set difference between the plurality of streams
- the sample cardinality is created for a set difference between the second data structure of the plurality of samples
- the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples.
- the MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set difference which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
- Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.
- a data processor such as a computing platform for executing a plurality of instructions.
- the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data.
- a network connection is provided as well.
- a display and/or a user input device such as a keyboard or mouse are optionally provided as well.
- FIG. 1 is a flowchart of an exemplary process of estimating a cardinality of set expressions between streams, according to some embodiments of the present invention
- FIG. 2 is a schematic illustration of an exemplary system for estimating a cardinality of set expressions between streams, according to some embodiments of the present invention.
- FIG. 3 is a schematic illustration of a sampled stream space. DESCRIPTION OF SPECIFIC EMBODIMENTS OF THE INVENTION
- the present invention in some embodiments thereof, relates to estimating a cardinality of a single stream and/or set expressions between multiple streams and, more particularly, but not exclusively, to estimating a cardinality of a single stream and/or set expressions between multiple streams using a significantly small sample of each of the streams.
- a cardinality of a single stream and/or a set expression in particular, a set union, a set intersection and a set difference between a plurality of streams (sets) using a significantly small sample of each of the streams.
- Each of the streams comprises a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an internet protocol (IP) packet and/or the like and the sample (sampled stream) of each stream comprises a group of elements randomly sampled from a respective stream.
- IP internet protocol
- Estimating the cardinality of streams as well as estimating the cardinality of set expressions between multiple streams may be useful for a plurality of applications ranging from data base queries to network traffic monitoring and security applications.
- computing a precise (exact) cardinality for the streams and moreover for the set expressions of the streams may be complex and costly at best and impractical at worst as often the streams may be extremely large.
- the cardinality computation may therefore require high computation resources, large storage resources and may further limit real-time computation.
- Estimating the cardinality of a stream using only a sample (sampled stream) of the stream which comprises elements randomly selected from the stream is known in the art.
- such estimation of extremely large streams may also require excessive computation and/or storage resources rendering the estimation impractical.
- such estimation may not be applicable for the set expressions between multiple large streams.
- a Maximal-Term with Sample (MTS) methodology presents an MTS sketch used by MTS based algorithms which may be used for accurately estimating the cardinality of the streams as well as the cardinality of the set expression between the plurality of streams using only a significantly small subsample of each of the samples (sampled streams) of the streams.
- MTS Maximal-Term with Sample
- the cardinality of the streams as and/or of the set expressions is estimated using an MTS sketch created for each of the samples.
- Each MTS sketch includes a first data structure (0 0121 ) and a second data structure (0 00 ).
- the first data structure (0 00 ) comprises a vector of maximal hash values computed for the elements in the respective sample using a plurality of hash functions.
- the second data structure (0 00 ) is a subsample of the respective sample and comprises a fixed-size subset of elements having the minimal maximal hash values among the elements of the respective sample.
- an estimated sample cardinality is first computed for the first data structure (0 00 ), i.e. the maximal hash values of elements in the sample using one or more max-sketch cardinality estimation technique, as known in the art, for example, HyperLogLog algorithm and/or the like.
- the second data structure (0 00 ) i.e. the fixed-size subset of the sample of the stream and applying one or more frequency estimation techniques as known in the art, for example, Good-Turing frequency estimation
- a ratio value is computed which estimates the proportion between cardinality of the elements appearing only once in the sampled stream (sample) and the cardinality of the elements appearing only once in the full (un-sampled) stream.
- the MTS methodology may efficiently extend the cardinality estimation to estimate the cardinality of the set expressions between the plurality of streams, i.e. multiple streams.
- the estimated cardinality may be computed for the set union which may be regarded as single" concatenated stream created by concatenating the plurality of streams. The same technique applied for the single stream may then apply for the concatenated stream.
- the MTS methodology further extends the cardinality estimation for the other set expression, in particular, the set intersection between the plurality of streams and the set difference between the plurality of streams.
- the estimated cardinality of the set intersection and/or the set difference may be derived from the cardinality estimation of the set union using set theorem conventions defining relations between the various set expressions, in particular, the Jaccard similarity statistics (also known as intersection over a set union and/or the Jaccard similarity coefficient) which are known in the art.
- the MTS sketch and algorithms may be used to estimate the cardinality of any sequence of set expressions between any number of streams using a small sample of each of the streams.
- the Jaccard similarity may be computed for the plurality of streams and/or for the set expression, in particular, the set intersection and the set difference using the MTS sketch, i.e. the first data structure and the second data structure created
- the MTS based cardinality estimation may present significant advantages for computing an estimated cardinality for extremely large streams and moreover for set expressions of multiple streams compared to existing methods, techniques and/or algorithms for computing and/or estimating the cardinality.
- Some of the existing methods may compute a precise cardinality for the stream by processing the entire un- sampled stream, i.e. analyzing each element in the stream. Such cardinality computation may require extremely high computation resources, storage resources and/or time thus rendering the cardinality computation inefficient, costly and may typically be impractical for extremely large streams.
- Other existing methods may apply one or more algorithms to compute an estimator for computing the cardinality of a sample of the stream, i.e. a sampled stream in order to estimate the cardinality of the stream.
- the computation resources, the storage resources and/or the processing time may be significantly reduced since the subsample is fixed in size and is significantly small.
- the MTS algorithms may be easily and efficiently extended for estimating the cardinality of set expressions of the streams. This may allow computing an estimated cardinality even for set expression of multiple extremely large streams and/or sets of elements.
- the cardinality estimation may be significantly simplified thus further reducing computation resources, storage resources and/or processing time.
- the accuracy of the estimation is maintained as presented herein after.
- the present invention may be a system, a method, and/or a computer program product.
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- FIG. 1 illustrates a flowchart of an exemplary process of estimating a cardinality of set expressions between streams, according to some embodiments of the present invention.
- An exemplary process 100 may be executed to estimate a cardinality of a stream (set) and/or of a set expression, in particular, a set union, a set intersection and a set difference between a plurality of streams (sets) each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
- the process 100 is applied to estimate the cardinality of the set expression using only a significantly small sample of each of the streams where each sample (sampled stream) comprises a group of elements randomly sampled from a respective stream.
- the process 100 estimates the cardinality of the single stream and/or of the set expressions using an MTS sketch created for each of the samples where each of the MTS sketches includes a first data structure and a second data structure (subsample) computed for each of the samples.
- the process 100 computes an estimated sample cardinality for a single stream and/or for set expression(s) of the samples using the first data structure(s) created for the samples by estimating the cardinality of the elements appearing once in the sample(s).
- the estimated cardinality of the sample and/or set expression(s) of the samples may be computed using one or more cardinality estimation tools as known in the art, for example, HyperLogLog algorithm and/or the like.
- the estimated sample cardinality is then applied with a computed ratio value which estimates the ratio (proportion) between the cardinality of the elements appearing only one in the sample compared to the cardinality of the elements appearing only once in the full stream.
- the ratio value is computed using the second data structure(s) and
- FIG. 2 is a schematic illustration of an exemplary system for estimating a cardinality of set expressions between streams, according to some embodiments of the present invention.
- An exemplary system 200 for executing a process such as the process 100 to estimate a cardinality of set expressions between streams (sets) comprises a computing node 201 for example, a computer, a server, a cluster of computing nodes and/or any device having one or more processors.
- the computing node 201 may typically include an input/output (I/O) interface 202 for obtaining a plurality of samples 220 of the plurality of streams, a processor(s) 204 and a storage 206.
- I/O input/output
- the I/O interface 202 may provide one or more interconnect interfaces, for example, a network interface, a local interface and/or the like.
- the network interface may support one or more wired and/or wireless network interfaces for connecting to one or more networks, for example, a Local Area Network (LAN), a wide Area Network (WAN), a Wireless LAN (WLAN) (e.g. Wi-Fi), a cellular network and/or the like.
- the local interface may include one or more interfaces, for example, a Universal Serial Bus (USB) interface, a memory management controller (MMC) interface, a serial interface and/or the like for connecting to one or more peripheral devices, for example a storage device and/or the like.
- USB Universal Serial Bus
- MMC memory management controller
- the processor(s) 204 may be arranged for parallel processing, as clusters and/or as one or more multi core processor(s).
- the storage 206 may include one or more computer readable medium devices, either persistent storage and/or volatile memory for one or more purposes, for example, storing program code, storing data, storing intermediate computation products and/or the like.
- the persistent storage may include one or more persistent memory devices, for example, a Flash array, a Solid State Disk (SSD) and/or the like for storing program code.
- the volatile memory may also include one or more volatile memory devices, for example, a Random Access Memory (RAM) device.
- the storage 206 may further include one or more networked storage resources, for example, a storage server, a Network Attached Storage (NAS) and/or the like accessible through the I/O interface 202.
- NAS Network Attached Storage
- the processor(s) 204 may execute one or more one or more software modules, for example, a process, an application, an agent, a utility, a script, a plug-in and/or the like.
- a software module may comprises a plurality of program instructions stored in a non-transitory medium such as the program store 206 and executed by a processor such as the processor(s) 204.
- the processors) 204 may execute, for example, a cardinality estimator 210 for estimating the cardinality of the set expression, in particular a set union, a set intersection and a set difference between a plurality of streams each comprising a plurality of elements, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
- the cardinality estimator 210 may estimate the cardinality of the set expression using only a significantly small sample 220 of each of the streams 220 obtained through the I/O interface 202 and/or from the storage 206.
- the cardinality estimator 210 is executed by one or more virtual machines (VM) hosted by a computing node such as the computing node 201.
- VM virtual machines
- the cardinality estimator 210 is utilized as one or more remote services, for example, a remote server service, a cloud service, a Software as a Service (SaaS), a Platform as a Service (PaaS) and/or the like which are accessible over one or more networks from the computing node 201.
- VM virtual machines
- the cardinality estimator 210 is utilized as one or more remote services, for example, a remote server service, a cloud service, a Software as a Service (SaaS), a Platform as a Service (PaaS) and/or the like which are accessible over one or more networks from the computing node 201.
- SaaS Software as a Service
- PaaS Platform as a Service
- the process 100 starts with the cardinality estimator 210 receiving a query for estimating a cardinality of a stream and/or of a set expression, in particular, a set union, a set intersection and a set difference between the plurality of streams each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
- a query for estimating a cardinality of a stream and/or of a set expression in particular, a set union, a set intersection and a set difference between the plurality of streams each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
- the cardinality estimator 210 obtains the sample 220 of the stream in case of the single stream and/or the samples 220 of the plurality of streams in case of the set expressions where each sample (sampled stream) 220 comprises a group of elements randomly sampled from the respective stream.
- the cardinality estimator 210 may obtain one or more of the samples 220 from one or more remote location, for example, a server, a cloud service, a cloud storage and/or the like which are accessible from the computing node 201 over one or more networks through the I/O interface 202.
- the cardinality estimator 210 may also obtain one or more of the samples 220 from the storage 206, either from a local storage and/or from a remote storage resource accessible through the I/O interface 202.
- the cardinality estimator 210 may obtain the sample(s) 220 from a local hard drive.
- the cardinality estimator 210 may obtain the sample(s) 220 from a NAS and/or the like.
- the cardinality estimator 210 may obtain the sample(s) 220 from an attachable storage drive and/or the like.
- the cardinality estimator 210 computes a first data structure and a second data structure for each of the samples 220. The computation of the first data structure(s) and the second data structure(s)
- the cardinality estimator 210 computes:
- cardinality estimator 210 may apply one or more cardinality estimation
- the cardinality estimator 210 may extend the cardinality estimation techniques applied to the single stream to compute the estimated sample cardinality value of a set union of the samples 220 which may be regarded as a concatenation of the samples 220.
- the cardinality estimator 210 may apply conventions of the set theorem including, for example, the Jaccard similarity for further extending the cardinality estimation for other set expressions, for example, the set intersection and/or the set difference.
- the cardinality estimator 210 reduces the ratio value computation to estimation of cardinality of elements appearing only once in the second data structure
- the cardinality estimator 210 may apply one or
- Good- Turing frequency estimation technique to compute the ratio between the estimated sample cardinality value and the estimated cardinality value of the entire stream(s).
- the computation of the estimated sample cardinality value and the computation of the ratio value is described in detail herein after, specifically, Algorithm 1 for the single stream, Algorithm 2 for the set union between two streams, Algorithm 3 for the set intersection between two streams and Algorithm 4 between two streams the set difference.
- Algorithm 1 for the single stream
- Algorithm 2 for the set union between two streams
- Algorithm 3 for the set intersection between two streams
- Algorithm 4 between two streams the set difference.
- the cardinality estimator 210 applies, for example, multiplies the computed ratio value to the estimated cardinality computed for the sample 220 (single stream) and/or for the set expression between the samples 220 (multiple stream) to compute an estimated cardinality for the entire stream and/or for the set expression between the entire streams (multiple streams).
- the computation of the estimated cardinality for the set expression between the streams is described in detail herein after, specifically, Algorithm 1 for the single stream, Algorithm 2 for the set union between two streams, Algorithm 3 for the set intersection between two streams and Algorithm 4 between two streams the set difference.
- Algorithm 5 for the single stream
- Algorithm 2 for the set union between two streams
- Algorithm 3 for the set intersection between two streams
- Algorithm 4 between two streams the set difference.
- the Good-Turing frequency estimation technique is useful in many language- related tasks where the problem is to determine the probability that a word appears in a document. Let be a stream of elements possibly with repetitions, and be the set of all different elements, such that Suppose that we want to estimate the probability that a randomly chosen element from the stream 0
- a naive approach is to choose a sample of elements from the stream 0, and then to set where denotes the number of appearances
- the hidden mass i.e. the estimator for the hidden
- elements may be estimated using the relative frequency of the elements that appear exactly once in the sample For example, if 1/10 of the elements in the sample appear only once in the sample then approximately 1/10 of the elements in are unseen elements, namely, they do not appear at all in the sample
- the Jaccard similarity value ranges between 0, when the two streams and are completely different, and 1 , when the two streams and are identical.
- An efficient and accurate estimate of is known in the art and may be computed as follows. First, each element in the streams and is hashed into (0, 1). Then, the maximal value of each stream is taken as a sketch that represents the whole stream. As demonstrated in the art, the probability that the sketches of the streams and are equal is exactly . When only one hash function is used, the variance of the estimate of may be infinite. Thus, 0 hash functions may be used, and the sketch representing each of the streams is actually a vector of 0 maximal values. As demonstrated in the art, improved performance may be attained if instead of 0 hash functions only two hash functions with stochastic averaging are used.
- Equation 1 Equation 1 below to estimate the Jaccard similarity of the streams and .
- Equation 2 the Jaccard similarity may be generalized to set difference as expressed in Equation 2 below. Equation 2:
- Equation 1 the estimator presented in Equation 1 may be generalized as expressed in Equation 3 below.
- the estimation may be performed for a set difference such as .
- the notations and are used herein after to indicate the Jaccard similarity variables , and respectively.
- the MTS methodology may be used to accurately estimate cardinality for set expressions of a plurality of streams using only a small sample of each of the streams.
- the set expressions for example, a set union, a set intersection, a set difference and/or the like are created by applying one or more combination functions, for example, a union, an intersection and a difference respectively to the plurality of streams.
- the (MTS) methodology and algorithms utilizing the MTS sketch are first presented for estimating the cardinality of set expression of two streams and are extended to set expression of the plurality of streams hereinafter.
- Estimating the cardinality for a single stream using a generic scheme that combines a sampling process with a cardinality estimation procedure of a single stream as known in the art may consist of two steps: (a) using one or more cardinality estimators as known in the art for estimating cardinality of a sampled stream comprising samples of the original stream; and (b) estimating a sampling ratio, namely, the factor by which the cardinality of the sampled stream should be multiplied in order to estimate the cardinality of the full original stream.
- Such estimation is typically based on storing a small fixed-size subsample of the sampled stream and using it to estimate the probability of unseen elements using the Good-Turing technique.
- the scheme used for estimating cardinality of the single stream may be generalized to set expressions between multiple streams.
- the cardinality estimation is based on maintaining an MTS sketch for each of the plurality of streams which comprises a small fixed-size subsample of the sampled stream (i.e. the sample and using this subsample for estimating the probability of unseen elements.
- the MTS sketch stores two data structures for each sampled stream (sample a first data structure and a second data
- Illustration 300 presents a stream comprising a plurality of elements for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
- a sample 0 is a sampled stream of the stream which comprises E elements randomly sampled from the stream such that Assuming that the sampling rate is 0, the sample includes of the elements of A subsample 0 includes part of the
- the subsample E may include ⁇ elements of the sample ⁇ which have minimal maximal hash values among the elements of the sample 0.
- the subsample 0 may be generated using, for example, one-pass reservoir sampling as known in the art. Using the one-pass reservoir sampling implementation, first, the subsample is initialized with the first elements of the sample E, namely, and the
- the new element is stored in the subsample instead of the element
- the subsample E stores the E elements whose hash values were minimum, and it can be considered as a uniform subsample of length
- MTS sketch is additive, i.e., the MTS sketch of a set union of a plurality of streams may be computed directly from the MTS sketches of the streams.
- Corollary 1 summarizes this additivity property for two streams, which may be generalized for any streams.
- the MTS methodology is first described for a single stream.
- estimating the cardinality for the single stream may be done by applying the Good-Turing technique to combine the sampling process with the generic cardinality estimation procedure of the single stream.
- the Good-Turing algorithm may receive the sampled stream (i.e.
- the Good- Turing algorithm consists of two steps: (a) estimating a cardinality of the sample 13 using any procedure for estimating the cardinality of a single stream without sampling as known in the art, the procedure is designated CAR EST PROC herein after; and (b) estimating ration factor by which the cardinality of the sample (sampled stream) should be multiplied in order to estimate the cardinality of the full original stream
- step (a) CAR EST PROC is invoked using storage units
- step (b) it is noted that the probability for unseen elements in the stream may be expressed as Therefore, the problem of estimating may be reduced to estimating the probability of unseen elements. According to the Good-Turing technique, is a consistent
- the storage elements number as well as processing resources may be significantly reduced thus reducing cost, complexity, time and/or the like by reducing the estimation problem to computing an approximation of the value of using the subsample of the sample according to some embodiments of the present
- This algorithm for estimating the cardinality of a single stream using the MTS sketch may be formulated by algorithm 1 below which utilizes procedure 1 below for estimating
- algorithm 1 may be extended for estimating the cardinality of a set union of the two stream and Assuming the samples be the samples (sampled streams) of the streams and
- Algorithm 2 which in turn may use Algorithm 1 for processing the MTS sketch of the concatenation 13 .
- algorithm 1 and algorithm 2 may be extended for estimating the cardinality of a set intersection of the two streams and .
- a a where is the Jaccard similarity of the two full streams and .
- Algorithm 2 may therefore be used for estimating a a while the Jaccard similarity for the streams and needs to be estimated.
- the Jaccard similarity may be expressed as shown in Equation 4 below.
- Equation 5 may be formulated according to Good-Turing (refer to Table 1 for the notations).
- Equation 5 into Equation 4 may produce Equation 6 below.
- Equation 7 Equation 7
- Algorithm 3 may be used for estimating the cardinality a a of the set intersection of using the samples In algorithm 3, may be estimated using Procedure 1. Additionally, may also be estimated using Procedure 1 using the . Finally, may be estimated from and
- algorithm 1 and algorithm 2 may be similarly extended for estimating the cardinality of a set difference of the two streams and .
- a a where 0 according to Equation 2.
- Algorithm 3 may be used for estimating the cardinality a a of the set difference using the samples C3 ⁇ 4 and 3 ⁇ 4, with the only difference being that the Jaccard similarity variable is estimated rather than .
- Equation 8 Equation 8
- Equation 9 may follow.
- Equation 9 may be rewritten as
- Algorithm 4 which is an adjustment of Algorithm 3 may be used for estimating the cardinality a a of the set difference of using the samples and In algorithm 4 may be estimated using Procedure 1.
- Algorithm 4 may be used for estimating the cardinality a a of the set difference of using the samples and In algorithm 4 may be estimated using Procedure 1.
- the MTS methodology in particular Algorithm 1, Algorithm 2, Algorithm 3 and/or Algorithm 4 may be extended to estimate the cardinality of set expressions between streams, where . Assuming are streams, and are the respective samples, i.e. their respective sampled streams. The samples may be used to estimate the
- the sample 0 may be expressed as
- Equation 11 Equation 11:
- indicator variable is 1 if, for the hash function, satisfy the condition implied by the set expressions, and is 0
- Algorithm 5 may be used for estimating the cardinality of the set
- Algorithm 5 consists of three steps: (a) using Equation 11 to estimate ; (b) using CAR EST PROC to estimate and (c) using Procedure 5 to estimate— , the factor (ratio) by which the cardinality 12 of the sampled stream 13 should be multiplied in order to estimate the cardinality 13 of the full stream 13.
- Algorithm 5 may use Procedure 5 below for estimating
- the correctness of the MTS methodology in particular, the correctness of Algorithm 1, Algorithm 2, Algorithm 3, Algorithm 4 and Algorithm 5 may be verified through an analytical analysis.
- Lemma 1 is presented to describe how to compute probability distribution of a product of two normally distributed random variables whose covariance is 0.
- HyperLogLog estimator belongs to a family of sketches and is may present
- the standard error of the HyperLogLog estimator is represents a number of storage units (e.g. registers) used for the estimation procedure.
- Pseudo-code of the HyperLogLog procedure is presented in Algorithm 6 below. Algorithm 6:
- Lemma 2 summarizes the statistical performance of Algorithm 6 without sampling, i.e., when the algorithm processes the entire stream.
- the considered set, 13 is the estimated cardinality computed using Algorithm 6, and is the number of storage units used by Algorithm 6.
- Algorithm 1 estimates with mean value and variance namely, where
- Algorithm 2 estimates a with mean value and variance
- Lemmas i.e. Lemma 3, Lemma 4 and Lemma 5 are used herein after for the analysis of the performance of Algorithm 3 and Algorithm 4 using the MTS sketch.
- Procedure 3 estimates with mean value and variance namely,
- Lemma 4 may be proved as follows:
- Procedure 3 estimates j) eno ⁇ e me distnict elements in the union subsample as For each the probability that
- Lemma 5 may be proved as follows:
- Equation 12 follows from covariance properties.
- Equation 13 As shown in Procedure 3, may be written as expressed in Equation 13 below.
- Equation 14 the covariance may be expressed as shown in Equation 14 below.
- Theorem 2 below states the asymptotic statistical performance of Algorithm 3 used for estimating the cardinality of the set intersection as described herein above.
- Algorithm 3 estimates a with mean value and variance namely,
- Theorem 2 may be proved as follows: may be denoted Similarly may be denoted with the respective expression.
- the estimator in Algorithm 3 as expressed in Equation 7 may be rewritten as follows:
- the asymptotic distribution of may be first analyzed. Recall that according to
- the variance may be expressed as
- Equation 17 Equation 17
- Theorem 3 below states the asymptotic statistical performance of Algorithm 3 used for estimating the cardinality of the set difference as described herein above.
- Algorithm 4 estimates with mean value and variance namely,
- Theorem 3 below states the asymptotic statistical performance of Algorithm 4 used for estimating the cardinality of the set difference as described herein above.
- Lemma 6 below is used for the analysis.
- Equation 1 the estimation of is normally distributed with mean and variance
- Lemma 6 may be proved for where the proof for and is similar. As known in the art, for the hash function the following applies:
- Equation and Equation 18 follows that is a sum of 3 Bernoulli variables. Therefore, it is binomially distributed, and can be asymptotically approximated to normal distribution as , namely,
- Algorithm 5 used for estimating the cardinality of set expressions, in particular a set union, a set intersection and a set difference between 13 streams (13 13 ) as described herein above.
- Theorem 4 below states the asymptotic statistical performance of Algorithm 5.
- Algorithm 5 estimates with mean value and variance— namely
- Theorem 4 may be proved as follows:
- Simulation tests were conducted to validate the MTS methodology, in particular the Theorems presented herein above developed to analyze and prove the MTS Algorithms 1 , 2, 3, 4 and 5, in particular to validate the asymptotic bias and variance performance of the presented MTS algorithms. More specifically, the simulation was conducted to demonstrate the following:
- Algorithms 3 and 4 are unbiased, as proven by Theorems 2 and 3.
- Algorithm 5 The variance of Algorithm 5 is close to its analyzed variance in Theorem 4.
- Uniform distribution The frequency of the elements is uniformly distributed between 100 and 1, 000; i.e.,
- the Pareto distribution has several unique properties. In particular, if the Pareto distribution has infinite variance, and if , the Pareto distribution has infinite mean. As decreases, a larger portion of the probability mass is in the tail of the distribution, and the Pareto distribution is therefore useful when a small percentage of the population controls the majority of the measured quantity.
- Each of the simulation tests was repeated for 1 ,000 different streams (sets) and . Thus, for each of the simulated MTS Algorithms and for each value of a vector of 1 ,000 different estimations was produced.
- the variance and bias of this vector were computed and the results as presented herein after are considered as the variance and bias of the respective Algorithm for a specific value of .
- Each such computation is represented by one table row in Table 2, Table 3 and Table 4 below.
- the vector of estimations for a specific Algorithm and for a specific value of may be expressed as A mean of the vector may be expressed as 0
- the bias and variance of 0 are computed as follows:
- the sampling ratio is In each table row we present the bias.
- the measured bias values are significantly low and practically tend to 0, indicating insignificant bias thus complying and in agreement with the analytical analysis for the bias of Algorithms 3 and 4.
- the expected length of each original stream is .
- a total storage budget of storage units per stream which is about 0.006% of the stream length, yields accurate estimation for both set intersection (Alg. 3) and set difference (Alg. 4) cardinalities.
- the expected length of each original stream is 500 ⁇ 106.
- Using a total storage budget of storage units, namely, of the stream length yields significantly accurate estimations for both set intersection and set difference cardinalities.
- Table 3 and Table 4 below present simulation tests results for both Algorithms 3 and 4 for different values of using uniform and Pareto frequency distributions.
- the sampling ratio is A and two values of are used, .
- the results are averaged over 1 , 000 runs of the simulation tests and the "analysis" variance is determined according to Theorems 2 and 3.
- the simulation tests aim to confirm Theorem 4 presented to analyze and theoretically verify Algorithm 5.
- the simulation tests for Algorithms 5 were conducted over three streams (sets), , and , each with distinct elements and uniformly distributed frequencies as described herein above for the simulation of Algorithms 3 and 4.
- the simulation tests were conducted to estimate the cardinality of a set expression a a which, as known in the art, may be expressed
- Table 5 presents the simulation tests results for different intersection values using uniform frequency distributions for buckets and .
- the sampling ratio is .
- the results are averaged over 1, 000 runs of the simulation tests, and the "analysis" variance is determined according to Theorem 4.
- the MTS methodology may be applied to a plurality of applications in a wide variety of domains.
- the MTS methodology may be used for Query optimization which may be required by database systems to determine a best (low-cost) plan for processing queries.
- the query optimization may be processed by a query optimizer which estimates the cost of a plan according to the input/output cardinalities of each plan's operator.
- Accurate cardinality estimation of set expressions over table fields in one scan using fixed memory as done by the MTS based cardinality estimation for the set expressions may be significantly valuable for such query optimizers. For example, assuming three large relational databases, with a shared field In case of processing a query such as , where 13 is a stream of tuples in the field The database system may need to determine the best (low-cost) plan for processing this query.
- the query optimizer may efficiently estimate the cardinality of the set expression
- the database system may select the best join strategy.
- the system may estimate the size of the outcome join.
- the MTS methodology may be used for network monitoring and security. Network management may require continuous measurement of multiple network parameters whose values may be efficiently estimated using the MTS sketch, using only a small portion of the monitored data.
- real-time detection of anomalies may be feasible. For example, assuming a network where ⁇ packets are received from different
- Two examples of monitoring applications may be as follows:
- cardinality estimation procedure and sampling technique are intended to include all such new technologies a priori.
- a compound or “at least one compound” may include a plurality of compounds, including mixtures thereof.
- range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
Abstract
A computer implemented method of estimating a cardinality of a stream, comprising: receiving a query for estimating a cardinality of a stream comprising a plurality of elements, obtaining a sample comprising a group of the plurality of elements randomly sampled from the respective stream, computing a first and second data structures for the sample used to compute an estimated sample cardinality of the sample and a ratio indicative of a proportion between the estimated sample cardinality and the estimated cardinality of the stream and computing the estimated cardinality of the stream by applying the ratio to the estimated sample cardinality. Where the first data structure comprises a plurality of maximal hash values computed for the sample using a plurality of hash functions and the second data structure comprises a fixed- size subset of the elements having a minimal hash value among the elements of the group.
Description
MTS SKETCH FOR ACCURATE ESTIMATION OF SET-EXPRESSION
CARDINALITIES FROM SMALL SAMPLES
FIELD AND BACKGROUND OF THE INVENTION
The present invention, in some embodiments thereof, relates to estimating a cardinality of a single stream and/or set expressions between multiple streams and, more particularly, but not exclusively, to estimating a cardinality of a single stream and/or set expressions between multiple streams using a significantly small sample of each of the streams.
With the evolution of information technology, the amount of data that is processed and/or transferred is constantly growing presenting major challenges to multiple applications that may need to process extremely large volumes of data, where in many cases such processing may need to be done in real-time.
Therefore, multiple various methods, techniques, frameworks and/or the like are continually developed to support and enable such applications to process the increasing data volumes.
One or more of such data processing methodologies may include identifying the cardinality, i.e. the number of distinct elements in streams and/or sets comprising a plurality of elements with repetitions may be of major interest for multiple applications ranging from database queries to network traffic monitoring and network security applications.
SUMMARY OF THE INVENTION
According to a first aspect of the present invention there is provided a computer implemented method of estimating a cardinality of a stream, comprising using one or more processors configured to execute a code, the code is adapted for:
Receiving a query for estimating a cardinality of a stream comprising a plurality of elements.
Obtaining a sample comprising a group of the plurality of elements randomly sampled from the respective stream.
Computing a first data structure and a second data structure for the sample. The first data structure comprising a plurality of maximal hash values each computed
for the sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample.
Computing, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream.
Computing the estimated cardinality value of the stream by applying the ratio value to the estimated sample cardinality value.
The MTS based cardinality estimation may present significant advantages for computing an estimated cardinality for extremely large streams and moreover for set expressions of multiple streams using only a significantly small data portion of the stream(s). By accurately estimating the cardinality for the subsample of the sampled stream (sample) of the stream as done by the MTS algorithm, the computation resources, the storage resources and/or the processing time may be significantly reduced since the subsample is fixed in size and is significantly small. Moreover, by reducing the cardinality estimation problem for estimating the cardinality of the sample to estimating the cardinality of elements appearing only once in the sample the cardinality estimation may be significantly simplified thus further reducing computation resources, storage resources and/or processing time while maintaining high accuracy of the estimated cardinality.
According to a second aspect of the present invention there is provided a system for estimating a cardinality of a stream, comprising one or more processors adapted to execute code, the code comprising:
- Code instructions to receive a query for estimating a cardinality of a stream comprising a plurality of elements;
code instructions to obtain a sample comprising a group of the plurality of elements randomly sampled from the respective stream.
Code instructions to compute a first data structure and a second data structure for the sample. The first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the
elements having a minimal hash value among the elements of the group of the sample.
Code instructions to compute, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream.
Code instructions to compute the estimated cardinality value of the stream by applying the ratio value to the estimated sample cardinality value.
According to a third aspect of the present invention there is provided a computer implemented method of estimating a cardinality of set expressions between streams, comprising using one or more processors configured to execute a code, the code is adapted for:
Receiving a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements.
Obtaining a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream.
Computing a first data structure and a second data structure for each of the plurality of samples. The first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample.
Computing, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression. Computing the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value.
Since the MTS sketch is additive in nature, the MTS algorithms used for estimating the cardinality of a single stream may be easily and efficiently extended for estimating the cardinality of set expressions of the streams, in particular, a set union, a set intersection and a set difference. This may allow computing an estimated cardinality even for set expression of multiple extremely large streams and/or sets of elements.
According to a fourth aspect of the present invention there is provided a system for estimating a cardinality of set expressions between streams, comprising one or more processors adapted to execute code, the code comprising:
Code instructions to receive a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements.
Code instructions to obtain a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream.
Code instructions to compute a first data structure and a second data structure for each of the plurality of samples. The first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions. The second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample.
Code instructions to compute, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression.
Code instructions to compute the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value. In a further implementation form of the first, second, third and/or fourth aspects, each of the plurality of elements includes one or more members of a group consisting of: a tuple, a word, a symbol, a binary representation, a numeral expression and an internet protocol (IP) packet. The MTS sketch based cardinality estimation may be applied to
estimate the cardinality of a diverse range of stream used by multiple applications which may be of very different nature. In particular, the type of the elements of the stream(s) may vary while the same concepts of the MTS sketch based cardinality estimation may apply.
In a further implementation form of the third and/or fourth aspects, the combination function is a union function to create a set union between the plurality of streams, the first data structure comprising the plurality of maximal hash values computed for a concatenation of the plurality of samples, the second data structure is created by selecting the fixed-size subset from the concatenation of the plurality of samples. The MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set union which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
In a further implementation form of the third and/or fourth aspects, the combination function is an intersection function to create a set intersection between the plurality of streams, the sample cardinality is created for a set intersection between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples. The MTS sketch based cardinality estimation may be applied to a plurality of set expressions between multiple streams specifically extremely large streams, in particular the set intersection which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
In a further implementation form of the third and/or fourth aspects, the combination function is a difference function to create a set difference between the plurality of streams, the sample cardinality is created for a set difference between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples. The MTS sketch based cardinality estimation may be applied to a
plurality of set expressions between multiple streams specifically extremely large streams, in particular the set difference which may be of major interest in a plurality of applications ranging from database query processing to network traffic monitoring and network security applications.
Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.
For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system, hi an exemplary embodiment of the invention, one or more tasks according to exemplary embodiments of method and/or system as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions. Optionally, the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data. Optionally, a network connection is provided as well. A display and/or a user input device such as a keyboard or mouse are optionally provided as well.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)
Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.
In the drawings:
FIG. 1 is a flowchart of an exemplary process of estimating a cardinality of set expressions between streams, according to some embodiments of the present invention;
FIG. 2 is a schematic illustration of an exemplary system for estimating a cardinality of set expressions between streams, according to some embodiments of the present invention; and
FIG. 3 is a schematic illustration of a sampled stream space. DESCRIPTION OF SPECIFIC EMBODIMENTS OF THE INVENTION
The present invention, in some embodiments thereof, relates to estimating a cardinality of a single stream and/or set expressions between multiple streams and, more particularly, but not exclusively, to estimating a cardinality of a single stream and/or set expressions between multiple streams using a significantly small sample of each of the streams.
According to some embodiments of the present invention, there are provided methods, systems and computer program products for estimating a cardinality of a single stream and/or a set expression, in particular, a set union, a set intersection and a set difference between a plurality of streams (sets) using a significantly small sample of each of the streams. Each of the streams comprises a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an internet protocol (IP) packet and/or the like and the sample (sampled stream) of each stream comprises a group of elements randomly sampled from a respective stream.
Estimating the cardinality of streams as well as estimating the cardinality of set expressions between multiple streams may be useful for a plurality of applications
ranging from data base queries to network traffic monitoring and security applications. However, computing a precise (exact) cardinality for the streams and moreover for the set expressions of the streams may be complex and costly at best and impractical at worst as often the streams may be extremely large. The cardinality computation may therefore require high computation resources, large storage resources and may further limit real-time computation. Estimating the cardinality of a stream using only a sample (sampled stream) of the stream which comprises elements randomly selected from the stream is known in the art. However, such estimation of extremely large streams may also require excessive computation and/or storage resources rendering the estimation impractical. Moreover, such estimation may not be applicable for the set expressions between multiple large streams.
According to some embodiments of the present invention, a Maximal-Term with Sample (MTS) methodology presents an MTS sketch used by MTS based algorithms which may be used for accurately estimating the cardinality of the streams as well as the cardinality of the set expression between the plurality of streams using only a significantly small subsample of each of the samples (sampled streams) of the streams.
The cardinality of the streams as and/or of the set expressions is estimated using an MTS sketch created for each of the samples. Each MTS sketch includes a first data structure (0 0121 ) and a second data structure (0 00 ). The first data structure (0 00 ) comprises a vector of maximal hash values computed for the elements in the respective sample using a plurality of hash functions. The second data structure (0 00 ) is a subsample of the respective sample and comprises a fixed-size subset of elements having the minimal maximal hash values among the elements of the respective sample.
For a single stream, an estimated sample cardinality is first computed for the first data structure (0 00 ), i.e. the maximal hash values of elements in the sample using one or more max-sketch cardinality estimation technique, as known in the art, for example, HyperLogLog algorithm and/or the like. Using the second data structure (0 00 ), i.e. the fixed-size subset of the sample of the stream and applying one or more frequency estimation techniques as known in the art, for example, Good-Turing frequency estimation, a ratio value is computed which estimates the proportion between cardinality of the elements appearing only once in the sampled stream (sample) and the cardinality of the elements appearing only once in the full (un-sampled) stream.
As the MTS sketch is additive, The MTS methodology may efficiently extend the cardinality estimation to estimate the cardinality of the set expressions between the plurality of streams, i.e. multiple streams. First, the estimated cardinality may be computed for the set union which may be regarded as single" concatenated stream created by concatenating the plurality of streams. The same technique applied for the single stream may then apply for the concatenated stream. The MTS methodology further extends the cardinality estimation for the other set expression, in particular, the set intersection between the plurality of streams and the set difference between the plurality of streams. The estimated cardinality of the set intersection and/or the set difference may be derived from the cardinality estimation of the set union using set theorem conventions defining relations between the various set expressions, in particular, the Jaccard similarity statistics (also known as intersection over a set union and/or the Jaccard similarity coefficient) which are known in the art. In general the MTS sketch and algorithms may be used to estimate the cardinality of any sequence of set expressions between any number of streams using a small sample of each of the streams.
The Jaccard similarity may be computed for the plurality of streams and/or for the set expression, in particular, the set intersection and the set difference using the MTS sketch, i.e. the first data structure
and the second data structure created
for the samples and/or the set expression between the samples.
The MTS based cardinality estimation may present significant advantages for computing an estimated cardinality for extremely large streams and moreover for set expressions of multiple streams compared to existing methods, techniques and/or algorithms for computing and/or estimating the cardinality. Some of the existing methods may compute a precise cardinality for the stream by processing the entire un- sampled stream, i.e. analyzing each element in the stream. Such cardinality computation may require extremely high computation resources, storage resources and/or time thus rendering the cardinality computation inefficient, costly and may typically be impractical for extremely large streams. Other existing methods may apply one or more algorithms to compute an estimator for computing the cardinality of a sample of the stream, i.e. a sampled stream in order to estimate the cardinality of the stream. However, such algorithms may be sensitive to the order of the elements and/or to the repetition pattern of the elements. Moreover, in case of extremely large streams, in particular streams that
need to be processed in real-time, the samples themselves may be significantly large thus requiring extensive computation and/or storage resources. Such algorithms may therefore not be suitable to real world applications in which large streams need to be processed in real time.
By accurately estimating the cardinality for the subsample of the sampled streams (samples) as done by the MTS algorithms, the computation resources, the storage resources and/or the processing time may be significantly reduced since the subsample is fixed in size and is significantly small.
Moreover, as the MTS sketch is additive in nature, the MTS algorithms may be easily and efficiently extended for estimating the cardinality of set expressions of the streams. This may allow computing an estimated cardinality even for set expression of multiple extremely large streams and/or sets of elements.
Furthermore, by reducing the cardinality estimation problem for estimating the cardinality of the sample(s) to estimating the cardinality of elements appearing only once in the sample and/or in the set expressions between the samples, the cardinality estimation may be significantly simplified thus further reducing computation resources, storage resources and/or processing time. However, while the cardinality estimation is significantly simplified, the accuracy of the estimation is maintained as presented herein after.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device,
a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Referring now to the drawings, FIG. 1 illustrates a flowchart of an exemplary process of estimating a cardinality of set expressions between streams, according to some embodiments of the present invention. An exemplary process 100 may be executed to estimate a cardinality of a stream (set) and/or of a set expression, in particular, a set union, a set intersection and a set difference between a plurality of streams (sets) each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like. The process 100 is applied to estimate the cardinality of the set expression using only a significantly small sample of each of the streams where each sample (sampled stream) comprises a group of elements randomly sampled from a respective stream.
The process 100 estimates the cardinality of the single stream and/or of the set expressions using an MTS sketch created for each of the samples where each of the MTS sketches includes a first data structure
and a second data structure
(subsample) computed for each of the samples. The process 100 computes an estimated sample cardinality for a single stream and/or for set expression(s) of the samples using the first data structure(s)
created for the samples by estimating the cardinality of the elements appearing once in the sample(s). The estimated cardinality of the sample and/or set expression(s) of the samples may be computed using one or more cardinality estimation tools as known in the art, for example, HyperLogLog algorithm and/or the like, The estimated sample cardinality is then applied with a computed ratio value which estimates the ratio (proportion) between the cardinality of the elements appearing only one in the sample compared to the cardinality of the elements appearing only once in the full stream. The ratio value is computed using the second data structure(s) and
applying one or more frequency estimation techniques as known in the art, for example, Good-Turing technique.
Reference is also made to FIG. 2, which is a schematic illustration of an exemplary system for estimating a cardinality of set expressions between streams, according to some embodiments of the present invention. An exemplary system 200 for executing a process such as the process 100 to estimate a cardinality of set expressions between streams (sets) comprises a computing node 201 for example, a computer, a server, a cluster of computing nodes and/or any device having one or more processors.
The computing node 201 may typically include an input/output (I/O) interface 202 for obtaining a plurality of samples 220 of the plurality of streams, a processor(s) 204 and a storage 206.
The I/O interface 202 may provide one or more interconnect interfaces, for example, a network interface, a local interface and/or the like. The network interface may support one or more wired and/or wireless network interfaces for connecting to one or more networks, for example, a Local Area Network (LAN), a wide Area Network (WAN), a Wireless LAN (WLAN) (e.g. Wi-Fi), a cellular network and/or the like. The local interface may include one or more interfaces, for example, a Universal Serial Bus (USB) interface, a memory management controller (MMC) interface, a serial interface and/or the like for connecting to one or more peripheral devices, for example a storage device and/or the like.
The processor(s) 204, homogenous or heterogeneous, may be arranged for parallel processing, as clusters and/or as one or more multi core processor(s).
The storage 206 may include one or more computer readable medium devices, either persistent storage and/or volatile memory for one or more purposes, for example, storing program code, storing data, storing intermediate computation products and/or the like. The persistent storage may include one or more persistent memory devices, for example, a Flash array, a Solid State Disk (SSD) and/or the like for storing program code. The volatile memory may also include one or more volatile memory devices, for example, a Random Access Memory (RAM) device. The storage 206 may further include one or more networked storage resources, for example, a storage server, a Network Attached Storage (NAS) and/or the like accessible through the I/O interface 202.
The processor(s) 204 may execute one or more one or more software modules, for example, a process, an application, an agent, a utility, a script, a plug-in and/or the like. Wherein a software module may comprises a plurality of program instructions stored in a non-transitory medium such as the program store 206 and executed by a processor such as the processor(s) 204. The processors) 204 may execute, for example, a cardinality estimator 210 for estimating the cardinality of the set expression, in particular a set union, a set intersection and a set difference between a plurality of streams each comprising a plurality of elements, for example, a table tuple, a database
tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like. In particular, the cardinality estimator 210 may estimate the cardinality of the set expression using only a significantly small sample 220 of each of the streams 220 obtained through the I/O interface 202 and/or from the storage 206. Optionally, the cardinality estimator 210 is executed by one or more virtual machines (VM) hosted by a computing node such as the computing node 201. Optionally, the cardinality estimator 210 is utilized as one or more remote services, for example, a remote server service, a cloud service, a Software as a Service (SaaS), a Platform as a Service (PaaS) and/or the like which are accessible over one or more networks from the computing node 201.
As shown at 102, the process 100 starts with the cardinality estimator 210 receiving a query for estimating a cardinality of a stream and/or of a set expression, in particular, a set union, a set intersection and a set difference between the plurality of streams each comprising a plurality of elements possibly with repetitions, for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like.
As shown at 104, the cardinality estimator 210 obtains the sample 220 of the stream in case of the single stream and/or the samples 220 of the plurality of streams in case of the set expressions where each sample (sampled stream) 220 comprises a group of elements randomly sampled from the respective stream. The cardinality estimator 210 may obtain one or more of the samples 220 from one or more remote location, for example, a server, a cloud service, a cloud storage and/or the like which are accessible from the computing node 201 over one or more networks through the I/O interface 202. The cardinality estimator 210 may also obtain one or more of the samples 220 from the storage 206, either from a local storage and/or from a remote storage resource accessible through the I/O interface 202. For example, the cardinality estimator 210 may obtain the sample(s) 220 from a local hard drive. In another example, the cardinality estimator 210 may obtain the sample(s) 220 from a NAS and/or the like. In another example, the cardinality estimator 210 may obtain the sample(s) 220 from an attachable storage drive and/or the like.
As shown at 106, the cardinality estimator 210 computes a first data structure and a second data structure for each of the samples 220. The
computation of the first data structure(s)
and the second data structure(s)
(1) An estimated sample cardinality value for the sample 220 of the stream and/or of one or more of the set expressions between the samples 220. Using the first data structure the cardinality estimator 210 may apply one or more cardinality estimation
tools as known in the art, for example, the HyperLogLog algorithm, to estimate the cardinality value of the sample 220 in case of the single stream. For the set expressions, the cardinality estimator 210 may extend the cardinality estimation techniques applied to the single stream to compute the estimated sample cardinality value of a set union of the samples 220 which may be regarded as a concatenation of the samples 220. The cardinality estimator 210 may apply conventions of the set theorem including, for example, the Jaccard similarity for further extending the cardinality estimation for other set expressions, for example, the set intersection and/or the set difference.
(2) A ratio value estimating the ratio (proportion) between the estimated sample cardinality value of the sample 220 (single stream) and/or of the set expression of the samples 220 (set expression between multiple streams) and the estimated cardinality of the entire (un-sampled) stream and/or the set expression between the entire streams respectively. In particular the cardinality estimator 210 reduces the ratio value computation to estimation of cardinality of elements appearing only once in the second data structure The cardinality estimator 210 may apply one or
more techniques as known in the art, for example, Good- Turing frequency estimation technique to compute the ratio between the estimated sample cardinality value and the estimated cardinality value of the entire stream(s).
The computation of the estimated sample cardinality value and the computation of the ratio value is described in detail herein after, specifically, Algorithm 1 for the single stream, Algorithm 2 for the set union between two streams, Algorithm 3 for the set intersection between two streams and Algorithm 4 between two streams the set difference. To estimate the cardinality of the set expression between 0 streams
where
the cardinality estimator 210 may apply Algorithm 5 extending Algorithms 2, 3 and/or 4 for the 0 streams.
As shown at 110, the cardinality estimator 210 applies, for example, multiplies the computed ratio value to the estimated cardinality computed for the sample 220 (single stream) and/or for the set expression between the samples 220 (multiple stream) to compute an estimated cardinality for the entire stream and/or for the set expression between the entire streams (multiple streams). The computation of the estimated cardinality for the set expression between the streams is described in detail herein after, specifically, Algorithm 1 for the single stream, Algorithm 2 for the set union between two streams, Algorithm 3 for the set intersection between two streams and Algorithm 4 between two streams the set difference. To estimate the cardinality of the set expression between streams where the cardinality estimator 210 may apply Algorithm 5
extending Algorithms 2, 3 and/or 4 for the 0 streams.
Preliminaries and Basis
Before describing one or more embodiments of the present invention some existing art techniques, methodologies and/or methods for estimating the cardinality are first described, in particular the Good-Turing frequency estimation technique and the Jaccard similarity statistic (also known as intersection over a set union and/or the Jaccard similarity coefficient).
The Good-Turing frequency estimation technique is useful in many language- related tasks where the problem is to determine the probability that a word appears in a document. Let
be a stream of elements possibly with repetitions, and
be the set of all different elements, such that
Suppose that we want to estimate the probability that a randomly chosen element from the stream 0
is 0 . A naive approach is to choose a sample
of elements from the stream 0, and then to set where denotes the number of appearances
of 0 in the sample 0. However, this approach may be inaccurate, because for each element 0 that does not appear even once in the sample
(i.e. an "unseen element"),
Let
be a set of elements that appear exactly
times in the sample Good-Turing frequency estimation claims that
is a consistent estimator for the probability that an element of
appears Etimes in the sample For the case where , the Good-Turing technique therefore suggests
elements) may be estimated using the relative frequency of the elements that appear exactly once in the sample
For example, if 1/10 of the elements in the sample
appear only once in the sample then approximately 1/10 of the elements in are
unseen elements, namely, they do not appear at all in the sample
and are two finite streams (sets). The Jaccard similarity value ranges between 0, when the two streams and are completely different, and 1 , when the two streams and are identical. An efficient and accurate estimate of is known in the art and may be computed as follows. First, each element in the streams and is hashed into (0, 1). Then, the maximal value of each stream is taken as a sketch that represents the whole stream. As demonstrated in the art, the probability that the sketches of the streams and are equal is exactly . When only one hash function is used, the variance of the estimate of may be infinite. Thus, 0 hash functions may be used, and the sketch representing each of the streams is actually a vector of 0 maximal values. As demonstrated in the art, improved performance may be attained if instead of 0 hash functions only two hash functions with stochastic averaging are used.
This may be stated formally as follows. Given a stream
and 13 different hash functions
the maximal hash value for the 13 hash function
The sketch of the stream may be therefore expressed as and the sketch of the stream may be expressed
Equation 1 below to estimate the Jaccard similarity of the streams and .
where the indicator variable and 0 otherwise.
As known in the art, the Jaccard similarity may be generalized to set difference as expressed in Equation 2 below.
Equation 2:
Thus, the estimator presented in Equation 1 may be generalized as expressed in Equation 3 below.
estimation may be performed for a set difference such as
. In order to simplify the notations, the notations , and are used herein after to indicate the Jaccard similarity variables ,
and respectively.
MTS Based Cardinality Estimation for a Set-Expression
According to some embodiments of the present inventions, the MTS methodology may be used to accurately estimate cardinality for set expressions of a plurality of streams using only a small sample of each of the streams. The set expressions, for example, a set union, a set intersection, a set difference and/or the like are created by applying one or more combination functions, for example, a union, an intersection and a difference respectively to the plurality of streams.
The (MTS) methodology and algorithms utilizing the MTS sketch are first presented for estimating the cardinality of set expression of two streams and are extended to set expression of the plurality of streams hereinafter.
Table 1 below presents some notations used herein after.
Table 1 :
Estimating the cardinality for a single stream using a generic scheme that combines a sampling process with a cardinality estimation procedure of a single stream as known in the art may consist of two steps: (a) using one or more cardinality estimators as known in the art for estimating cardinality of a sampled stream comprising samples of the original stream; and (b) estimating a sampling ratio, namely, the factor by which the cardinality of the sampled stream should be multiplied in order to estimate the cardinality of the full original stream. Such estimation is typically based on storing a small fixed-size subsample of the sampled stream and using it to estimate the probability of unseen elements using the Good-Turing technique.
According to some embodiments of the present invention, the scheme used for estimating cardinality of the single stream may be generalized to set expressions between multiple streams. The cardinality estimation is based on maintaining an MTS sketch for each of the plurality of streams which comprises a small fixed-size subsample of the sampled stream (i.e. the sample
and using this subsample for estimating the probability of unseen elements. To this end, the MTS sketch stores two data structures for each sampled stream (sample a first data structure and a second data
structure where
includes the maximal hash value for each hash function:
Reference is now made to FIG. 3, which is a schematic illustration of a sampled stream space. Illustration 300 presents a stream
comprising a plurality of elements
for example, a table tuple, a database tuple, a word, a symbol, a binary representation, a numeral expression, an IP packet and/or the like. A sample 0 is a sampled stream of the stream
which comprises E elements randomly sampled from the stream such that
Assuming that the sampling rate is 0, the sample includes
of the elements of A subsample 0 includes part of the
sample in particular, the subsample E may include Ξ elements of the sample Ξ which have minimal maximal hash values among the elements of the sample 0. The subsample 0 may be generated using, for example, one-pass reservoir sampling as known in the art.
Using the one-pass reservoir sampling implementation, first, the subsample is
initialized with the first
elements of the sample E, namely, and the
elements are then sorted in decreasing order of their hash values. When a new element is sampled into the sample the hash value of the newly sampled element is compared to
the current maximal hash value of the elements in the subsample
In case the hash value of the new element is smaller than the current maximal hash value of the elements in the subsample the new element is stored in the subsample instead of the element
having the maximal hash value. Otherwise, the new element is ignored. After all elements of the sample
are processed, the subsample E stores the E elements whose hash values were minimum, and it can be considered as a uniform subsample of length
It should be noted that MTS sketch is additive, i.e., the MTS sketch of a set union of a plurality of streams may be computed directly from the MTS sketches of the streams. Corollary 1 below summarizes this additivity property for two streams, which may be generalized for any streams.
Corollary 1 :
Assuming and are two streams (sets) with samples designated and
respectively, the MTS sketches of the streams and are:
Cardinality Estimation for a Single Stream
The MTS methodology is first described for a single stream. As described here in above, estimating the cardinality for the single stream may be done by applying the Good-Turing technique to combine the sampling process with the generic cardinality estimation procedure of the single stream. The Good-Turing algorithm may receive the
sampled stream (i.e. the sample as an input and returns an estimate for the cardinality The Good- Turing algorithm consists of two steps: (a) estimating a cardinality of the sample 13 using any procedure for estimating the cardinality of a single stream without sampling as known in the art, the procedure is designated CAR EST PROC herein after; and (b) estimating ration factor
by which the cardinality of the sample
(sampled stream) should be multiplied in order to estimate the cardinality of the full original stream
To estimate
in step (a), CAR EST PROC is invoked using
storage units To estimate in step (b), it is noted that the probability
for unseen elements in the stream may be expressed as
Therefore, the problem of estimating
may be reduced to estimating the probability of
unseen elements. According to the Good-Turing technique, is a consistent
estimator for
as described herein above. Thus, identifying the number
elements that appear exactly once in the sampled stream may be sufficient for estimating the cardinality of the stream
To compute the value precisely as known in the
art, all the elements in the sample
may need to be tracked and while ignoring each previously encountered element. To this end,
storage units may be needed, which is linear in the sample size and is therefore not scalable.
However, the storage elements number as well as processing resources may be significantly reduced thus reducing cost, complexity, time and/or the like by reducing the estimation problem to computing an approximation of the value of
using the subsample of the sample according to some embodiments of the present
invention. This algorithm for estimating the cardinality of a single stream
using the MTS sketch may be formulated by algorithm 1 below which utilizes procedure 1 below for estimating
Algorithm 1:
Cardinality Estimation for a Set Union between Two Streams
According to some embodiments of the present invention, algorithm 1 may be extended for estimating the cardinality of a set union of the two stream and Assuming the samples be the samples (sampled streams) of the streams and
respectively. Let
be the concatenation of the samples
The concatenation is actually a sample of , i.e (refer to Table 1 for
the notation). Thus, estimating the cardinality of is equivalent to estimating the cardinality of a single stream using the concatenation
Estimating the estimating the cardinality of using the samples may be done using
Algorithm 2 below which in turn may use Algorithm 1 for processing the MTS sketch of the concatenation 13 .
Algorithm 2:
Cardinality Estimation of a Set Intersection between Two Streams
According to some embodiments of the present invention, algorithm 1 and algorithm 2 may be extended for estimating the cardinality of a set intersection of the two streams and . As known in the art a
a , where is the Jaccard similarity of the two full streams and . Algorithm 2 may therefore be used for
estimating a a while the Jaccard similarity for the streams and needs to be estimated. As known in the art the Jaccard similarity may be expressed as shown in Equation 4 below.
Equation 5 below may be formulated according to Good-Turing (refer to Table 1 for the notations).
Equation 5 into Equation 4 may produce Equation 6 below.
may be rewritten as expressed in Equation 7 below.
Algorithm 3 below may be used for estimating the cardinality a a of the set intersection of using the samples
In algorithm 3,
may be estimated using Procedure 1. Additionally,
may also be estimated using Procedure 1 using the
. Finally, may be estimated from and
using Procedure 3 below.
Algorithm 3:
Procedure 3 :
Cardinality Estimation of a Set Difference between Two Streams
According to some embodiments of the present invention, algorithm 1 and algorithm 2 may be similarly extended for estimating the cardinality of a set difference of the two streams and . As known in the art a
a , where 0 according to Equation 2. Thus, Algorithm 3 may be used for estimating the cardinality a a of the set difference using the samples C¾ and ¾, with the only difference being that the Jaccard similarity variable is estimated rather than .
Applying the inclusion-exclusion principle and some algebraic manipulations, the variable may be formulated as expressed in Equation 8 below.
Equation 8:
expressed in Equation 10 below.
Algorithm 4 below which is an adjustment of Algorithm 3 may be used for estimating the cardinality a a of the set difference of using the samples
and In algorithm 4 may be estimated using Procedure 1. In addition,
MTS Based Cardinality Estimation for a Set Expression between Multiple Streams
According to some embodiments of the present invention the MTS methodology, in particular Algorithm 1, Algorithm 2, Algorithm 3 and/or Algorithm 4 may be extended to estimate the cardinality of set expressions between
streams, where
. Assuming are
streams, and
are the respective samples, i.e. their respective sampled streams. The samples may be used to estimate the
cardinality of
. As presented herein above for the case of the two streams and , the sample 0 may be expressed as
may be denoted by
respectively. Denoting as a "generalized" Jaccard similarity the "generalized" Jaccard similarity may be estimated from
Where the indicator variable
is 1 if, for the
hash function, satisfy the condition implied by the set expressions, and is 0
otherwise.
expression
between the
streams with sampling using the MTS sketch methodology. Algorithm 5 consists of three steps: (a) using Equation 11 to estimate ; (b) using CAR EST PROC to estimate
and (c) using Procedure 5 to estimate— , the factor (ratio) by which the cardinality 12 of the sampled stream 13 should be multiplied in order to estimate the cardinality 13 of the full stream 13.
Algorithm 5:
Analytical Analysis
The correctness of the MTS methodology, in particular, the correctness of Algorithm 1, Algorithm 2, Algorithm 3, Algorithm 4 and Algorithm 5 may be verified through an analytical analysis. In order to simplify the notations, the notation 0 to denote the estimated cardinality in each of the Algorithms.
Lemma 1 is presented to describe how to compute probability distribution of a product of two normally distributed random variables whose covariance is 0.
Lemma 1 (Product distribution):
Assuming
are two random variables satisfying the condition
, and then as known in the art, the product asymptotically satisfies
For the analysis, the HyperLogLog algorithm as known in the art is used for the CAR EST PROC procedure in the MTS based Algorithms described herein above. The HyperLogLog estimator belongs to a family of sketches and is may present
improved cardinality estimation compared to other estimators known in the art. The standard error of the HyperLogLog estimator is
represents a number of storage units (e.g. registers) used for the estimation procedure. Pseudo-code of the HyperLogLog procedure is presented in Algorithm 6 below.
Algorithm 6:
Lemma 2 below summarizes the statistical performance of Algorithm 6 without sampling, i.e., when the algorithm processes the entire stream.
Lemma 2:
the considered set, 13 is the estimated cardinality computed using Algorithm 6, and
is the number of storage units used by Algorithm 6.
Corollary 2:
Let and be two streams. When Algorithm 6 is used with 13 storage units and without sampling, the following applies:
As presented in the art, the asymptotic bias and variance of Algorithm 1 was analyzed when using the HyperLogLog algorithm as the CAR EST PROC. It was demonstrated that the sampling rate does not affect the asymptotic unbiasedness of the estimator. The effect of the sampling rate on the estimator's variance was further analyzed with respect to the storage sizes 0 and 13. The following theorem summarizes the statistical performance of Algorithm 1.
Theorem 1:
and is the frequency of element 0 in stream 0.
As described herein above estimating the set union cardinality using Algorithm 2 is equivalent to estimating the cardinality of a single stream based on its sampled stream 0 . Thus, the statistical performance of Algorithm 2 is equal to that of Algorithm 1.
Corollary 3:
where 0 and 0 are as stated in Theorem 1 with respect to the union stream
The following Lemmas, i.e. Lemma 3, Lemma 4 and Lemma 5 are used herein after for the analysis of the performance of Algorithm 3 and Algorithm 4 using the MTS sketch.
Lemma 3:
As proved in the art where
Lemma 4:
where s the cardinality of
Procedure 3 estimates
j)eno^e me distnict elements in the union subsample as For each the probability that
belongs to may be expressed as follows:
It follows that is a sum of Bernoulli variables with success probability ¾.
Therefore, it is binomially distributed, and can be asymptotically approximated using normal distribution as
Lemma 5:
cardinality of
Lemma 5 may be proved as follows:
thus Equation 12 below follows from covariance properties.
The distinct elements in the union subsample may be denoted
As shown in Procedure 3, may be written as expressed in Equation 13 below.
where is an indicator variable that gets 1 and 0 otherwise.
otherwise.
Using covariance properties and Equation 13 the covariance may be expressed as shown in Equation 14 below.
Equation 14:
The first and third equalities are due to covariance properties. The second equality is due to the independence
The fourth equality is due to Lemma 4. It should be noted that follows in the same way as the proof of Lemma
4. The last equality is obtained through algebraic manipulations. The resulting expression follows by substituting Equation 14 into Equation 12.
Theorem 2 below states the asymptotic statistical performance of Algorithm 3 used for estimating the cardinality of the set intersection as described herein above.
Theorem 2:
where satisfies the following condition:
Theorem 2 may be proved as follows:
may be denoted
Similarly
may be denoted with the respective expression. Thus, the estimator in Algorithm 3 as expressed in Equation 7 may be rewritten as follows:
Good-Turing Equation 15 below follows.
The second equality follows by substituting and using Equation 15.
The variance may be expressed as
Lemma 4. This may result in Equation 16 below.
The asymptotic distribution of ¾ may be analyzed similarly.
The estimator
in Algorithm 3 is now analyzed. Note that
expectation may therefore be expressed as:
It follows that is an unbiased estimator for . The variance may be expressed by Equation 17 below.
Equation 17:
variance properties and the second equality follows from Equation 16 and Lemma 5.
proof.
Theorem 3 below states the asymptotic statistical performance of Algorithm 3 used for estimating the cardinality of the set difference as described herein above.
Theorem 3:
, where
where s as stated in Theorem 2.
Theorem 3 below states the asymptotic statistical performance of Algorithm 4 used for estimating the cardinality of the set difference as described herein above. Lemma 6 below is used for the analysis.
Lemma 6:
In Equation 1 the estimation of is normally distributed with mean and variance
The same may apply for the estimation of and according to Equation 3, with the change of to and respectively.
Lemma 6 may be proved for where the proof for and is similar. As known in the art, for the hash function the following applies:
P
The intuition is considering the hash function
and defining
for every sample 0, as the element in the sample 13 whose hash value for
is maximum
From Equation and Equation 18 follows that is a sum of
3 Bernoulli variables. Therefore, it is binomially distributed, and can be asymptotically approximated to normal distribution as , namely,
Now analyzing Algorithm 5 used for estimating the cardinality of set expressions, in particular a set union, a set intersection and a set difference between 13 streams (13 13 ) as described herein above. Theorem 4 below states the asymptotic statistical performance of Algorithm 5.
Theorem 4:
Theorem 4 may be proved as follows:
The estimator for the set expression between the 13 streams
recall that
According to Lemma 6, the term may be expressed as
According to Corollary 2, the term a may be expressed
Considering the product
then according to Lemma 1 and because the variables are independent, the following may be obtained:
Thus
Denoting
athe estimator is the final term in the estimator, i.e. is
now analyzed. According to Lemma Therefore
according to Lemma 1 and because the variables are independent, the following may be obtained:
Turing.
Simulation Tests
Simulation tests were conducted to validate the MTS methodology, in particular the Theorems presented herein above developed to analyze and prove the MTS Algorithms 1 , 2, 3, 4 and 5, in particular to validate the asymptotic bias and variance performance of the presented MTS algorithms. More specifically, the simulation was conducted to demonstrate the following:
Algorithms 3 and 4 are unbiased, as proven by Theorems 2 and 3.
The variance of Algorithms 3 and 4 is close to their analyzed variance in
Theorems 2 and 3.
The variance of Algorithm 5 is close to its analyzed variance in Theorem 4.
The simulations tests were conducted with the MTS Algorithms implementing the HyperLogLog as the CAR EST PROC procedure for estimating the cardinality.
The simulation tests for Algorithms 3 and 4 were conducted over two streams (sets), and , whose cardinalities are as follows:
Each distinct element appears times in the original un-sampled streams and . The f equencies are determined according to the following models known in the art:
Uniform distribution: The frequency
of the elements is uniformly distributed between 100 and 1, 000; i.e.,
Pareto distribution: The frequency
of the elements follows the heavy-tailed rule with shape parameter
and scale parameter
i.e., the frequency probability function is The
scale parameter
represents the smallest possible frequency. The Pareto distribution has several unique properties. In particular, if
the Pareto distribution has infinite variance, and if
, the Pareto distribution has infinite mean. As decreases, a larger portion of the probability mass is in the tail of the distribution, and the Pareto distribution is therefore useful when a small percentage of the population controls the majority of the measured quantity. Each of the simulation tests was repeated for 1 ,000 different streams (sets) and . Thus, for each of the simulated MTS Algorithms and for each value of a vector of 1 ,000 different estimations was produced. Then, for each value of , the variance and bias of this vector were computed and the results as presented herein after are considered as the variance and bias of the respective Algorithm for a specific value of . Each such computation is represented by one table row in Table 2, Table 3 and Table 4 below. The vector of estimations for a specific Algorithm and for a specific value of may be expressed as A mean of the vector may be expressed as 0
The bias and variance of 0 are computed as follows:
First presented are the simulation tests results for the bias of Algorithm 3 applied for estimating cardinality of a set intersection and Algorithm 4 applied for estimating cardinality of a set difference as described herein before. Table 2 below presents the simulation tests results for the bias of Algorithm 3 (Alg. 3) and Algorithm 4 (Alg. 4) for different values of using uniformly distributed frequencies
storage units (buckets) and ) and Pareto distributed frequencies
The sampling ratio is In each table row we present the bias.
Table 2:
As evident from the results in Table 2, the measured bias values are significantly low and practically tend to 0, indicating insignificant bias thus complying and in agreement with the analytical analysis for the bias of Algorithms 3 and 4. For the uniform distribution, the number of distinct elements
Thus, the expected length of each original stream is
. A total storage budget of
storage units per stream, which is about 0.006% of the stream length, yields accurate estimation for both set intersection (Alg. 3) and set difference (Alg. 4) cardinalities. For the Pareto distribution, the expected length of each original stream is 500 · 106. Using a total storage budget of
storage units, namely, of the stream length, yields significantly accurate estimations for both set intersection and set difference cardinalities.
Now presented are the simulation tests results for the variance of Algorithms 3 and 4. Table 3 and Table 4 below present simulation tests results for both Algorithms 3 and 4 for different values of using uniform and Pareto frequency distributions. In both tables, buckets and
. The sampling ratio is
A and two values of are used,
. The results are averaged over 1 , 000 runs of the
simulation tests and the "analysis" variance is determined according to Theorems 2 and 3.
As can be seen in Table 3 and Table 4, the algorithm variance is always lower than 20% and in most cases lower than 10%, thus complying and in excellent agreement with the results expected by the analytical analysis.
Now presented are simulation test results for simulations of Algorithm 5 used for estimating the cardinality of set expression between 0 streams where
. The simulation tests aim to confirm Theorem 4 presented to analyze and theoretically verify Algorithm 5.
The simulation tests for Algorithms 5 were conducted over three streams (sets), , and , each with distinct elements and uniformly distributed frequencies as described herein above for the simulation of Algorithms 3 and 4. The simulation tests were conducted to estimate the cardinality of a set expression a a which, as known in the art, may be expressed
hi the simulation test described herein after we fix the cardinality of a a and estimate for different values of the intersection using
Algorithm 5. The three streams , and have the following cardinalities:
The simulation tests are conducted to estimate the cardinality
for different values of the intersection
Table 5 below presents the simulation tests results for different intersection values using uniform frequency distributions for
buckets and
. The sampling ratio is
. The results are averaged over 1, 000 runs of the simulation tests, and the "analysis" variance is determined according to Theorem 4.
Table 5:
As evident from the test results in Table, the relative error of the variance of Algorithm 5 as measured in the simulation tests is approximately 5% which is very similar to the variance expected by the analytical analysis. As expected, when the cardinality increases hence the estimated cardinality increases as
well), the variance decreases.
The MTS methodology may be applied to a plurality of applications in a wide variety of domains.
For example, the MTS methodology may be used for Query optimization which may be required by database systems to determine a best (low-cost) plan for processing queries. The query optimization may be processed by a query optimizer which estimates the cost of a plan according to the input/output cardinalities of each plan's operator. Accurate cardinality estimation of set expressions over table fields in one scan using fixed memory as done by the MTS based cardinality estimation for the set expressions may be significantly valuable for such query optimizers. For example, assuming three large relational databases,
with a shared field
In case of processing a query such as
, where 13 is a stream of tuples in the field
The database system may need to determine the best (low-cost) plan for processing this query. Using the MTS sketch, the query optimizer may efficiently estimate the cardinality of the set expression
In another example, in the case of a join query, by estimating the cardinality (number of distinct tuples) of each involved table, the database system may select the best join strategy. Moreover, by estimating the cardinality of a set intersection between all involved tables, the system may estimate the size of the outcome join.
In another example, the MTS methodology may be used for network monitoring and security. Network management may require continuous measurement of multiple network parameters whose values may be efficiently estimated using the MTS sketch, using only a small portion of the monitored data. Moreover, by processing only a small portion of the traffic, real-time detection of anomalies may be feasible. For example, assuming a network where ΓΡ packets are received from different
through router . Two examples of monitoring applications may be as follows:
(a) Using the MTS sketch, the total number of IP packets that enter the network can be efficiently estimated by estimating the cardinality of the set union
(b) Assuming all incoming ΓΡ packets must pass through a firewall. This may be verified by verifying that the set
is empty, where 0 is the set of IP flows that enter the firewall. This verification may be efficiently done using the MTS sketch by estimating the cardinality
a and verifying that this cardinality tends to
0.
It is expected that during the life of a patent maturing from this application many relevant systems, methods and computer programs will be developed and the scope of the terms cardinality estimation procedure and sampling technique are intended to include all such new technologies a priori.
The terms "comprises", "comprising", "includes", "including", "having" and their conjugates mean "including but not limited to".
The term "consisting of means "including and limited to".
As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.
Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as
individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "ranging/ranges between" a first indicate number and a second indicate number and "ranging/ranges from" a first indicate number "to" a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.
Claims
1. A computer implemented method of estimating a cardinality of a stream, comprising:
using at least one processor configured to execute a code, the code is adapted for:
receiving a query for estimating a cardinality of a stream comprising a plurality of elements;
obtaining a sample comprising a group of the plurality of elements randomly sampled from the respective stream;
computing a first data structure and a second data structure for the sample, the first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample;
computing, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream; and
computing the estimated cardinality value of the stream by applying the ratio value to the estimated sample cardinality value.
2. The computer implemented method of claim I, wherein each of the plurality of elements includes at least one member of a group consisting of: a tuple, a word, a symbol, a binary representation, a numeral expression and an internet protocol (IP) packet.
3. A computer implemented method of estimating a cardinality of set expressions between streams, comprising:
using at least one processor configured to execute a code, the code is adapted for:
receiving a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements;
obtaining a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream;
computing a first data structure and a second data structure for each of the plurality of samples, the first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample;
computing, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression; and
computing the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value.
4. The computer implemented method of claim 3, wherein the combination function is a union function to create a set union between the plurality of streams, the first data structure comprising the plurality of maximal hash values computed for a concatenation of the plurality of samples, the second data structure is created by selecting the fixed-size subset from the concatenation of the plurality of samples.
5. The computer implemented method of claim 3, wherein the combination function is an intersection function to create a set intersection between the plurality of streams, the sample cardinality is created for a set intersection between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples.
6. The computer implemented method of claim 3, wherein the combination function is a difference function to create a set difference between the plurality of streams, the sample cardinality is created for a set difference between the second data structure of the plurality of samples, the ratio value is computed using a Jaccard similarity value computed for the plurality of samples and applied to an estimated cardinality computed for a set union between the second data structure of the plurality of samples.
7. A system for estimating a cardinality of a stream, comprising:
at least one processor adapted to execute code, the code comprising:
code instructions to receive a query for estimating a cardinality of a stream comprising a plurality of elements;
code instructions to obtain a sample comprising a group of the plurality of elements randomly sampled from the respective stream;
code instructions to compute a first data structure and a second data structure for the sample, the first data structure comprising a plurality of maximal hash values each computed for the sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the sample;
code instructions to compute, using the first data structure and the second data structure, an estimated sample cardinality value of the sample and a ratio value indicative of a proportion between the estimated sample cardinality value and the estimated cardinality value of the stream; and
code instructions to compute the estimated cardinality value of the stream by applying the ratio value to the estimated sample cardinality value.
8. A system for estimating a cardinality of set expressions between streams, comprising:
at least one processor adapted to execute code, the code comprising:
code instructions to receive a query for estimating a cardinality of a stream set expression created by applying a combination function to a plurality of streams each comprising at least some of a plurality of elements;
code instructions to obtain a plurality of samples each associated with a respective one of the plurality of streams and comprising a group of the plurality of elements randomly sampled from the respective stream;
code instructions to compute a first data structure and a second data structure for each of the plurality of samples, the first data structure comprising a plurality of maximal hash values each computed for the each sample using a respective one of a plurality of hash functions, the second data structure comprising a fixed-size subset of the elements having a minimal hash value among the elements of the group of the each sample;
code instructions to compute, using the first data structure and the second data structure of the plurality of samples, an estimated sample cardinality value of a sample set expression created by applying the combination function to the plurality of samples and a ratio value indicative of a proportion between the sample cardinality value and an estimated cardinality value of the stream set expression; and
code instructions to compute the estimated cardinality value of the stream set expression by applying the ratio value to the estimated sample cardinality value.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662406019P | 2016-10-10 | 2016-10-10 | |
US62/406,019 | 2016-10-10 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018069928A1 true WO2018069928A1 (en) | 2018-04-19 |
Family
ID=61905256
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IL2017/051134 WO2018069928A1 (en) | 2016-10-10 | 2017-10-10 | Mts sketch for accurate estimation of set-expression cardinalities from small samples |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2018069928A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114218292A (en) * | 2021-11-08 | 2022-03-22 | 中国人民解放军国防科技大学 | Multi-element time sequence similarity retrieval method |
US11711310B2 (en) | 2019-09-18 | 2023-07-25 | Tweenznet Ltd. | System and method for determining a network performance property in at least one network |
US11716338B2 (en) | 2019-11-26 | 2023-08-01 | Tweenznet Ltd. | System and method for determining a file-access pattern and detecting ransomware attacks in at least one computer network |
CN117792962A (en) * | 2024-02-28 | 2024-03-29 | 苏州大学 | Distributed stream base measuring method, device and computer readable storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150100596A1 (en) * | 2013-10-06 | 2015-04-09 | Yahoo! Inc. | System and method for performing set operations with defined sketch accuracy distribution |
-
2017
- 2017-10-10 WO PCT/IL2017/051134 patent/WO2018069928A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150100596A1 (en) * | 2013-10-06 | 2015-04-09 | Yahoo! Inc. | System and method for performing set operations with defined sketch accuracy distribution |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11711310B2 (en) | 2019-09-18 | 2023-07-25 | Tweenznet Ltd. | System and method for determining a network performance property in at least one network |
US11716338B2 (en) | 2019-11-26 | 2023-08-01 | Tweenznet Ltd. | System and method for determining a file-access pattern and detecting ransomware attacks in at least one computer network |
CN114218292A (en) * | 2021-11-08 | 2022-03-22 | 中国人民解放军国防科技大学 | Multi-element time sequence similarity retrieval method |
CN114218292B (en) * | 2021-11-08 | 2022-10-11 | 中国人民解放军国防科技大学 | Multi-element time sequence similarity retrieval method |
CN117792962A (en) * | 2024-02-28 | 2024-03-29 | 苏州大学 | Distributed stream base measuring method, device and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018069928A1 (en) | Mts sketch for accurate estimation of set-expression cardinalities from small samples | |
Huo et al. | A SDN‐based fine‐grained measurement and modeling approach to vehicular communication network traffic | |
Fenu et al. | Network analysis via partial spectral factorization and Gauss quadrature | |
US11204851B1 (en) | Real-time data quality analysis | |
US20210135948A1 (en) | Discovering a computer network topology for an executing application | |
EP3679473B1 (en) | A system and method for stream processing | |
CN113746798B (en) | Cloud network shared resource abnormal root cause positioning method based on multi-dimensional analysis | |
Hahn et al. | Reachability and reward checking for stochastic timed automata | |
Lee et al. | Computing the stationary distribution locally | |
US20170262500A1 (en) | Processing a database query in a database system | |
EP3375142A1 (en) | Managing network alarms | |
Caron et al. | Some recent results in rare event estimation | |
Khomonenko et al. | Performance evaluation of cloud computing accounting for expenses on information security | |
Tran et al. | Conditioning and aggregating uncertain data streams: Going beyond expectations | |
Chen et al. | An efficient solution to locate sparsely congested links by network tomography | |
Wang et al. | Estimating multiclass service demand distributions using Markovian arrival processes | |
Cohen et al. | Cardinality estimation meets good-turing | |
WO2016085443A1 (en) | Application management based on data correlations | |
Zadorozhnyi et al. | Methods of simulation queueing systems with heavy tails | |
Mokhlissi et al. | The evaluation of the number and the entropy of spanning trees on generalized small-world networks | |
Kharchenko et al. | Monte-Carlo simulation and availability assessment of the smart building automation systems considering component failures and attacks on vulnerabilities | |
Nakajima et al. | Social graph restoration via random walk sampling | |
Nie et al. | A compressive sensing‐based approach to end‐to‐end network traffic reconstruction utilising partial measured origin‐destination flows | |
Fernandes et al. | Digital signature to help network management using principal component analysis and K-means clustering | |
Budić et al. | Optimizing Mobile Radio Access Network Spectrum Refarming Using Community Detection Algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17861119 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17861119 Country of ref document: EP Kind code of ref document: A1 |