US8392381B2 - Systems and methods for reservoir sampling of streaming data and stream joins - Google Patents
Systems and methods for reservoir sampling of streaming data and stream joins Download PDFInfo
- Publication number
- US8392381B2 US8392381B2 US12/599,163 US59916308A US8392381B2 US 8392381 B2 US8392381 B2 US 8392381B2 US 59916308 A US59916308 A US 59916308A US 8392381 B2 US8392381 B2 US 8392381B2
- Authority
- US
- United States
- Prior art keywords
- sampling
- size
- reservoir
- tuples
- computer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01D—MEASURING NOT SPECIALLY ADAPTED FOR A SPECIFIC VARIABLE; ARRANGEMENTS FOR MEASURING TWO OR MORE VARIABLES NOT COVERED IN A SINGLE OTHER SUBCLASS; TARIFF METERING APPARATUS; MEASURING OR TESTING NOT OTHERWISE PROVIDED FOR
- G01D21/00—Measuring or testing not otherwise provided for
Definitions
- the present invention generally relates to the field of statistical sampling.
- the present invention is directed to systems and methods for reservoir sampling of streaming data and stream joins.
- Uniform random sampling has been known for its usefulness and efficiency for generating consistent and unbiased estimates of an underlying population. In this sampling scheme, every possible sample of a given size has the same chance to be selected. Uniform random sampling has been heavily used in a wide range of application domains such as statistical analysis, computational geometry, graph optimization, knowledge discovery, approximate query processing, and data stream processing.
- sampling When data subject to sampling come in the form of a data stream (e.g. stock price analysis, and sensor networks monitoring), sampling encounters two major challenges. First, the size of the stream is usually unknown a priori and, therefore, it is not possible to predetermine the sampling fraction (i.e., sampling probability) by the time sampling starts. Second, in most cases the data arriving in a stream cannot be stored and, therefore, have to be processed sequentially in a single pass. A technique commonly used in this scenario is reservoir sampling, which selects a uniform random sample of a given size from an input stream of an unknown size. Reservoir sampling has been used in many database applications including clustering, data warehousing, spatial data management, and approximate query processing.
- Conventional reservoir sampling selects a uniform random sample of a fixed size, without replacement, from an input stream of an unknown size (see Algorithm 1, below). Initially, the algorithm places all tuples in the reservoir until the reservoir (of the size of r tuples) becomes full. After that, each k th tuple is sampled with the probability r/k. A sampled tuple replaces a randomly selected tuple in the reservoir. It is easy to verify that the reservoir always holds a uniform random sample of the k tuples seen so far.
- Conventional reservoir sampling assumes a fixed size reservoir (i.e., the size of a sample is fixed).
- Data collection applications have been extensively addressed in research literature.
- a mobile sink traverses the network and collects data from sensors.
- each sensor needs to retain a uniform random sample of the join output, instead of streaming out the sample tuples toward the sink.
- a natural solution to keep a uniform random sample of the join output stream is to use reservoir sampling.
- keeping a reservoir sample over stream joins is not trivial since streaming applications can be limited in memory size.
- the present disclosure is directed to a method of maintaining a uniform random sample by a machine.
- the method includes: establishing in a machine memory a sampling reservoir having a size; receiving a data stream containing sequentially arriving tuples; sampling the data stream so as to store ones of the sequentially arriving tuples in the sampling reservoir so as to create stored tuples; while sampling, adjusting the size of the sampling reservoir in a controlled manner; and after adjusting the size, continuing sampling the data stream and storing ones of the sequentially arriving tuples in the sampling reservoir so as to maintain a sample of the data stream with a certain uniformity confidence.
- the machine memory has a limited size and the method further includes: establishing in the machine memory a plurality of sampling reservoirs each having a size; receiving a plurality of data streams each containing a plurality of sequentially arriving tuples, the plurality of data streams corresponding respectively to the plurality of sampling reservoirs; checking whether the size of any one or more of the plurality of sampling reservoirs should be changed; and for each of the plurality of sampling for which the size should be changed, adjusting the size of that one of the plurality of sampling reservoirs as a function of the limited size of the machine memory.
- the present disclosure is directed to a method of performing join sampling by a machine.
- the method includes: establishing in a machine memory a sampling reservoir having a sampling reservoir size, and a join buffer having a join buffer size; simultaneously receiving a plurality of data streams; join-sampling the plurality of data stream so as to create a plurality of join-sample tuples; storing the plurality of join-sample tuples in the join buffer; reservoir sampling the plurality of join-sample tuples so as to create a plurality of reservoir sample tuples; and storing the plurality of reservoir sample tuples in the sampling reservoir.
- the present disclosure is directed to a computer-readable medium containing computer-executable instructions for performing a method of maintaining a uniform random sample.
- the computer-executable instructions include: a first set of computer-executable instructions for receiving a data stream containing sequentially arriving tuples; a second set of computer-executable instructions for sampling the data stream so as to store ones of the sequentially arriving tuples in a sampling reservoir so as to create stored tuples; a third set of computer-executable instructions for adjusting the size of the sampling reservoir in a controlled manner while sampling; and a fourth set of computer-executable instructions for continuing sampling the data stream after the adjusting of the size and storing ones of the sequentially arriving tuples in the sampling reservoir so as to maintain a sample of the data stream with a certain uniformity confidence.
- the machine memory has a limited size and the computer-executable instructions further include: computer-executable instructions for establishing in the machine memory a plurality of sampling reservoirs each having a size; computer-executable instructions for receiving a plurality of data streams each containing a plurality of sequentially arriving tuples, the plurality of data streams corresponding respectively to the plurality of sampling reservoirs; computer-executable instructions for checking whether the size of any one or more of the plurality of sampling reservoirs should be changed; and computer-executable instructions that, for each of the plurality of sampling reservoirs for which the size should be changed, adjusts the size of that one of the plurality of sampling reservoirs as a function of the limited size of the machine memory.
- the present disclosure is directed to a computer-readable medium containing computer-executable instructions for performing a method of maintaining a uniform random sample.
- the computer-executable instructions include: a first set of computer-executable instructions for establishing in a machine memory a sampling reservoir, having a sampling reservoir size, and a join buffer having a join buffer size; a second set of computer-executable instructions for simultaneously receiving a plurality of data streams; a third set of computer-executable instructions for join-sampling the plurality of data stream so as to create a plurality of join-sample tuples; a fourth set of computer-executable instructions for storing the plurality of join-sample tuples in the join buffer; a fifth set of computer-executable instructions for reservoir sampling the plurality of join-sample tuples so as to create a plurality of reservoir sample tuples; and a sixth set of computer-executable instructions for storing the plurality of reservoir sample tuples in
- the present disclosure is directed to a system that includes: at least one processor for processing computer-executable instructions; and memory functionally connected to the at least one processor, the memory containing computer-executable instructions for performing a method of maintaining a uniform random sample.
- the computer executable instructions include: a first set of computer-executable instructions for receiving a data stream containing sequentially arriving tuples; a second set of computer-executable instructions for sampling the data stream so as to store ones of the sequentially arriving tuples in a sampling reservoir so as to create stored tuples; a third set of computer-executable instructions for adjusting the size of the sampling reservoir in a controlled manner while sampling; and a fourth set of computer-executable instructions for continuing sampling the data stream after the adjusting of the size and storing ones of the sequentially arriving tuples in the sampling reservoir so as to maintain a sample of the data stream with a certain uniformity confidence.
- the memory includes a portion having a limited size and further contains: computer-executable instructions for establishing in the portion of the memory a plurality of sampling reservoirs each having a size; computer-executable instructions for receiving a plurality of data streams each containing a plurality of sequentially arriving tuples, the plurality of data streams corresponding respectively to the plurality of sampling reservoirs computer-executable instructions for checking whether the size of any one or more of the plurality of sampling reservoirs should be changed; and computer-executable instructions that, for each of the plurality of sampling reservoirs for which the size should be changed, adjusts the size of that one of the plurality of sampling reservoirs as a function of the limited size of the portion of the memory.
- the present disclosure is directed to a system that includes: at least one processor for processing computer-executable instructions; and memory functionally connected to the at least one processor, the memory containing computer-executable instructions for performing a method of maintaining a uniform random sample.
- the computer executable instructions include: a first set of computer-executable instructions for establishing in a machine memory a sampling reservoir, having a sampling reservoir size, and a join buffer having a join buffer size; a second set of computer-executable instructions for simultaneously receiving a plurality of data streams; a third set of computer-executable instructions for join-sampling the plurality of data stream so as to create a plurality of join-sample tuples; a fourth set of computer-executable instructions for storing the plurality of join-sample tuples in the join buffer; a fifth set of computer-executable instructions for reservoir sampling the plurality of join-sample tuples so as to create a plurality of reservoir sample tuples; and a sixth set of computer-executable instructions for storing the plurality of reservoir sample tuples in the sampling reservoir.
- FIG. 1 is a high-level diagram of a wireless sensor network implementing systems and methods of the present disclosure
- FIG. 2 is a diagram illustrating a decrease in the size of a sampling reservoir during sampling in the context of adaptive reservoir sampling performed in accordance with the present disclosure
- FIG. 3 is a diagram illustrating an increase in the size of a sampling reservoir during sampling in the context of adaptive reservoir sampling performed in accordance with the present disclosure
- FIG. 4 is a graph of uniformity confidence with respect to the uniformity confidence recovery tuple count, m, for a sampling reservoir of an increasing size
- FIG. 5 is a magnified portion of the graph of FIG. 4 for m greater than or equal to 9000;
- FIG. 6 is a graph of the total number of readings versus mote identifier for a plurality of motes used in an experiment to evaluate performance of an adaptive multi-reservoir sampling algorithm of the present disclosure
- FIG. 8 is a graph of reservoir size versus time for selected reservoirs (mote IDs 2 , 15 , 31 , 49 and 54 ) used in the experiment corresponding to FIG. 6 ;
- FIG. 11 is a high-level diagram illustrating a join-sampling processing model
- FIG. 12 is graph of sample size versus time for a sample size experiment performed on reservoir join-sampling (RJS) and progressive reservoir join-sampling (PRJS) algorithms of the present disclosure, wherein reservoir size increase time is marked with a diamond and sample-use time is marked with a circle;
- RJS reservoir join-sampling
- PRJS progressive reservoir join-sampling
- FIG. 13 is a magnified graph of sample size versus time for the sample size experiment performed on the PRJS algorithm
- FIG. 14 is a graph of reservoir size versus the number, l, of tuples that would be generated without join-sampling by the time the reservoir sample will be used for the sample size experiment performed on the PRJS algorithm, showing the effect of l on the reservoir size;
- FIG. 15 is a graph of sample uniformity versus time for both of the RJS and PRJS algorithms during the sample size experiment, wherein reservoir size increase time is marked with a diamond and sample-use time is marked with a circle;
- FIG. 16 is a graph of sample uniformity versus time for the PRJS algorithm during experiments using partially sorted streams
- FIG. 17 is a graph of average absolute aggregation error versus time for both of the RJS and PRJS algorithms for a set of experiments directed to comparing RJS and PRJS in terms of the accuracy of aggregation;
- FIG. 18 is a high-level schematic diagram illustrating a computing device representative of computing devices that could be used for implementing any one or more of the algorithms of the present disclosure.
- the present invention relates to the development by the present inventors of novel algorithms for reservoir sampling of data streams and stream joins, as well as to systems and methods that implement these algorithms.
- the novel algorithms include adaptive reservoir sampling, for single and multiple reservoirs, in which the size of the reservoir(s) at issue is/are dynamically increased and/or decreased in size during the sampling.
- the novel algorithms also include fixed and progressive (increasing in size) reservoir join sampling.
- Uniformity confidence is the probability that a sampling algorithm generates a uniform random sample.
- a sample is a uniform random sample if it is produced using a sampling scheme in which all statistically possible samples of the same size are equally likely to be selected. In this case, we say the uniformity confidence in the sampling algorithm equals 100%. In contrast, if some statistically possible samples cannot be selected using a certain sampling algorithm, then we say the uniformity confidence in the sampling algorithm is below 100%.
- uniformity confidence as follows:
- the uniformity confidence in a reservoir sampling algorithm which produces a sample S of size r is defined as the probability that S [r] is a uniform random sample of all the tuples seen so far. That is, if k tuples have been seen so far, then the uniformity confidence is 100% if and only if every statistically possible S [r] has an equal probability to be selected from the k tuples.
- FIG. 1 shows a wireless sensor network 100 that includes a plurality of wireless sensors 104 distributed in a plurality of spatial clusters, here three clusters 108 A-C.
- each cluster 108 A-C has an associated proxy 112 A-C that includes a memory 116 that stores sensor readings from the sensors of that cluster and acts as a data cache.
- a mobile sink 120 navigates network 100 to periodically collect data from proxies 112 A-C. Memory 116 of each proxy 112 A-C, however, is limited and, therefore, may store only samples of the readings.
- Each proxy 112 A-C may very well keep multiple reservoir samples 124 , one for each sensor 104 in the corresponding cluster 108 A-C.
- a software application for example, a monitoring and analysis application 128 aboard a computer 132 , such as a general purpose computer (e.g., laptop, desktop, server, etc.), may demand that the size of a reservoir be in proportion to the number of readings generated so far by the corresponding sensor. If, for example, the sampling rates of sensors 104 change over time, the reservoir sizes should be adjusted dynamically as the sampling rates of the sensors change.
- each sensor 104 may include one or more of any type of transducer or other measurement device suitable for sensing the desired state(s) at the location of that sensor.
- transducer or other measurement device suitable for sensing the desired state(s) at the location of that sensor.
- device types include, but are not limited to, temperature transducers, motion transducers (e.g. accelerometers), displacement transducers, flow transducers, speed transducers, pressure transducers, moisture transducers, chemical sensing transducers, photo transducers, voltage sensors, electrical current sensors, electrical power sensors, and radiation transducers, among many others.
- Wireless sensors, proxies, mobile sinks and computers suitable for use, respectively, as wireless sensors 104 , proxies 112 A-C, mobile sink 120 and computer 132 , are well known in the art and therefore need not be described in any detail herein for those skilled in the art to implement concepts of the present invention to their fullest scope. The same holds true for actual software applications that correspond to software application 128 .
- periodic queries are appropriate for many real-time streaming applications, such as security surveillance and health monitoring.
- query instances are instantiated periodically by the system. Upon instantiation, a query instance takes a snapshot of tuples that arrived since the last instantiation of the query.
- a technique like random sampling is used to reduce the stream rate.
- the system may keep a reservoir sample of stream data arriving between the execution times of two consecutive query instances.
- the system should provide a way to increase the reservoir size for better representing the stream data at the execution time of the next query instance.
- each query may have its own reservoir sample maintained.
- one or more queries can be registered to or expired from the system.
- the system should be able to adaptively reallocate the memory among all reservoirs of the current query set.
- the following section presents a novel algorithm (called “adaptive reservoir sampling”) for maintaining a reservoir sample for a single reservoir after the reservoir size is adjusted. If the size decreases, the algorithm maintains a sample in the reduced reservoir with a 100% uniformity confidence by randomly evicting tuples from the original reservoir. If the size increases, the algorithm finds the minimum number of incoming tuples that should be considered in the input stream to refill the enlarged reservoir such that the resulting uniformity confidence exceeds a given threshold. Then, the algorithm decides probabilistically on the number of tuples to retain in the enlarged reservoir and randomly evicts the remaining number of tuples. Eventually, the algorithm fills the available room in the enlarged reservoir using the incoming tuples.
- Algorithm 2 Adaptive Reservoir Sampling Inputs: r ⁇ reservoir size ⁇ k ⁇ number of tuples seen so far ⁇ ⁇ ⁇ uniformity confidence threshold ⁇ 1: while true do 2: while reservoir size does not change do 3: conventional reservoir sampling (Algorithm 1 (Background section, above). 4: end while 5: if reservoir size is decreased by ⁇ then 6: randomly evicts ⁇ tuples from the reservoir. 7: else 8: ⁇ i.e., reservoir size is increased by ⁇ 9: Find the minimum value of m (using Equation ⁇ 3 ⁇ , below, with the current values of k, r, ⁇ ) that causes the UC to exceed ⁇ .
- the size of a reservoir is increased from r to r+ ⁇ ( ⁇ >0) immediately after the k th tuple arrives (see FIG. 3 ). Then, the reservoir has room for 6 additional tuples. Clearly, there is no way to fill this room from sampling the k tuples as they have already passed by. Only incoming tuples can be used to fill the room. The number of incoming tuples used to fill the enlarged reservoir is denoted as m and is called the “uniformity confidence recovery tuple count.”
- r existing tuples are allowed to be evicted probabilistically and replaced by some of the incoming m tuples.
- the number of tuples evicted (or equivalently, the number of tuples retained) are randomly picked.
- the number of tuples that are retained, x can be no more than r.
- x should not be less than (r+ ⁇ ) ⁇ m if m ⁇ r+ ⁇ (because otherwise, with all m incoming tuples the enlarged reservoir cannot be refilled), and no less than 0 otherwise.
- FIG. 4 shows this pattern for one setting of k, r, and ⁇ . Note that the uniformity confidence never reaches 100%, as exemplified by FIG. 5 which magnifies the uniformity confidence curve of FIG. 4 for m ⁇ 9000.
- the exemplary adaptive reservoir sampling algorithm works as shown in Algorithm 2, above. As long as the size of the reservoir does not change, it uses conventional reservoir sampling to sample the input stream (Line 3). If the reservoir size decreases by ⁇ , the algorithm evicts ⁇ tuples from the reservoir randomly (Line 6). After that, the algorithm continues sampling using the conventional reservoir sampling (Line 3). On the other hand, if the reservoir size increases by ⁇ , the algorithm computes the minimum value of m (using Equation ⁇ 3 ⁇ ) that causes the uniformity confidence to exceed a given threshold ( ⁇ ) (Line 9).
- the algorithm flips a biased coin to decide on the number of tuples (x) to retain among the r tuples already in the reservoir (Line 10).
- the probability of choosing the value x, where max ⁇ 0, (r+ ⁇ ) ⁇ m ⁇ x ⁇ r, is defined as:
- the algorithm randomly evicts r ⁇ x tuples from the reservoir (Line 11) and refills the remaining reservoir space with r+ ⁇ x tuples from the arriving m tuples using conventional reservoir sampling (Line 12).
- the algorithm continues sampling the input stream using the conventional reservoir sampling (Line 3) as if the sample in the enlarged reservoir were a uniform random sample of the k+m tuples.
- FIG. 1 is illustrative of a wireless sensor network that could be used for any of these applications. Those skilled in the art will readily understand the modifications needed to implement the present invention in a wired or hybrid type sensor network.
- an adaptive multi-reservoir sampling algorithm is based on the following key ideas.
- an objective of the algorithm is to adaptively adjust the memory allocation in each proxy so that the size of each reservoir is allocated in proportion to the number of readings (i.e., tuples) generated so far by the corresponding sensor. More specifically, this objective is to allocate the memory of size M to the reservoirs (R 1 , R 2 , . . . , Rv) of v input streams (S 1 , S 2 , . . .
- the algorithm adjusts the memory allocation only if the relative change in the size of at least one reservoir is above a given memory adjustment threshold and the resulting uniformity confidence for all reservoirs exceeds a given uniformity confidence threshold.
- Confidence interval is the range in which the true value of the population is estimated to be.
- Confidence level is the probability value associated with a confidence interval.
- Degree of variability in the population is the degree in which the attributes being measured are distributed throughout the population. A more heterogeneous population requires a larger sample to achieve a given confidence interval. Based on these criteria, the following simplified formula for calculating a statistically appropriate sample size is provided, assuming 95% confidence level and 50% degree of variability (note that 50% indicates the maximum variability in a population):
- n N 1 + N ⁇ ⁇ e 2 ⁇ 5 ⁇
- n the sample size
- N the population size
- e 1 ⁇ confidence interval
- r i ⁇ ( t ) k i ⁇ ( t ) 1 + k i ⁇ ( t ) ⁇ ⁇ e 2 ⁇ 6 ⁇ subject to the following limit on the total memory M:
- this computed reservoir size r i M (t) may be different from the reservoir size r i (t u ) adjusted at time point t u (t u ⁇ t).
- ⁇ i (t) denotes the difference.
- the uniformity confidence is 100%.
- the uniformity confidence is below 100%; in this case, as in Algorithm 2, the sample is maintained in an enlarged reservoir R i using incoming tuples from the input stream.
- Equation ⁇ 3 ⁇ the uniformity confidence expressed in Equation ⁇ 3 ⁇ is refined here as follows:
- Equation ⁇ 8 ⁇ the relative change in the computed size
- ⁇ a given threshold for some R i , that is, the adjustment is considered if Equation ⁇ 11 ⁇ holds for some i ⁇ 1, 2, . . . v ⁇ .
- an exemplary adaptive multi-reservoir algorithm is as follows:
- Algorithm 3 Adaptive Multi-Reservoir Sampling Inputs: ⁇ , ⁇ , M, T, ⁇ r 1 (t u ), r 2 (t u ), ..., r v (t u ) ⁇ , ⁇ k 1 (t), k 2 (t), ..., k v (t) ⁇ , ⁇ 1 (t), ⁇ 2 (t), ..., ⁇ v (t) ⁇ 1: while true do 2: while there are no tuples arriving from any stream do 3: ⁇ do nothing.
- ⁇ 4 end while ⁇ one or more tuples arrived from some streams
- ⁇ 5 compute r i (t) (Equation ⁇ 6 ⁇ ) for the streams from which tuples arrived.
- L′ enlarged set of all R i ⁇ L enlarged whose UC i (k i (t), r i (t u ), ⁇ i (t), m i (t)) ⁇ ⁇ 15: if L′ enlarged is empty then 16: for each R i ⁇ (L reduced ⁇ L enlarged ) do 17: if R i ⁇ L reduced then 18: randomly evicts ⁇ i (t) tuples from R i .
- Algorithm 3 works as follows. As long as there are no tuples arriving from any stream, the algorithm stays idle (Lines 2-4). Upon the arrival of a new tuple from any stream, it computes r i (t) for those streams from which tuples arrived (Line 5) and computes r i M (t) and ⁇ i (t) for all streams (Lines 6-9). Then, it checks if the relative change in the size of any reservoir is larger than the memory adjustment threshold ⁇ (using Equation ⁇ 11 ⁇ ) (Line 10). If so, it computes m i (t) for all of the enlarged reservoirs (Lines 12 and 13).
- the purpose of this evaluation is to empirically examine the adaptivity of the exemplary multi-reservoir sampling algorithm (Algorithm 3, above) with regard to reservoir size and sample uniformity. Two sets of experiments were conducted. The objective of the first set of experiments was to observe how the reservoirs sizes change as data arrive. The objective of the second set of experiments was to observe the uniformity of the reservoir samples as the reservoirs sizes change.
- TinyDB TinyDB
- TinyOS TinyOS
- the resulting data file includes a log of about 2.3 million readings collected from these sensors.
- the schema of records was (date: yyyy-mm-dd, time: hh:mm:ss.xxx, epoch:int, moteid: int, temperature: real, humidity: real, light:real, voltage:real).
- Temperature is in degrees Celsius.
- Humidity is temperature-corrected relative humidity, ranging from 0 to 100%.
- Light is in Lux.
- Voltage is expressed in volts, ranging from 2.0 to 3.0 volts.
- the uniformity confidence threshold ⁇ was set to 0.90. It is believed that this value is adequately large to constrain the frequency of adjusting the memory allocation.
- M the value of M from 1000 (tuples) to 5000 (tuples) and range the memory adjustment threshold ⁇ from 0.1 to 0.5. Readings acquired for the whole first day of the experiment were used in the experiments. Data collection was done every 1 hour and, accordingly, report results on the change in reservoir size and sample uniformity every hour.
- FIG. 7 shows the changes in the sizes of the 55 reservoirs.
- FIG. 8 shows the changes for 5 selected reservoirs.
- the reservoir sizes started fluctuating. The fluctuations were smooth and small in the first stage (from the 2nd to the 4th hour), larger in the second stage (from the 4th to the 21st hour), and eventually diminished in the last stage (after the 21st hour).
- This pattern of changes is attributed to the characteristics of data sets used in the experiments. In the first stage, there was no tangible difference between the numbers of readings acquired by different motes. Therefore, reservoir sizes stayed almost constant.
- FIG. 9 shows a similar pattern except that the changes in reservoir sizes happened less frequently, and saturated earlier. The reason for these observations can be easily seen from Equations ⁇ 6 ⁇ and ⁇ 11 ⁇ . Results obtained for varying other parameters (M and ⁇ ) show similar patterns, and are omitted due to space constraint.
- ⁇ 2 statistics measures the relative difference between the observed number of tuples (o(v)) and the expected number of tuples (e(v)) that contain the value v. That is:
- FIG. 10 shows the changes in size and the resulting sample uniformity for one selected reservoir. It shows that when the reservoir size increases, the sample uniformity degrades (i.e., decreases) and then starts recovering (i.e., increasing). The degree of uniformity degradation and recovery varies due to randomness in the data sets used in experiments.
- this disclosure also addresses the problem of reservoir sampling over memory-limited stream joins. Novel concepts directed to this problem and two algorithms for performing reservoir sampling on the join result are presented below. These algorithms are referred to herein as the “reservoir join-sampling” (RJS) algorithm and the “progressive reservoir join-sampling” (PRJS) algorithm.
- RJS reservoir join-sampling
- PRJS progressive reservoir join-sampling
- the reservoir size is fixed.
- the sample in the reservoir is always a uniform random sample of the join result. Therefore, RJS fits those applications that may use the sample in the reservoir at any time (e.g., continuous queries).
- This algorithm may not accommodate a memory-limited situation in which the available memory may be too small even for storing tuples in the join buffer. In such a situation, it may be infeasible to allocate the already limited memory to a reservoir with an adequately large size.
- the PRJS algorithm is designed to alleviate this problem by increasing the reservoir size during the sampling process.
- the conventional reservoir sampling technique of RJS is replaced with what is referred to herein as “progressive reservoir sampling.”
- progressive reservoir sampling is the case of adaptive reservoir sampling (see Algorithm 2, above) in which the sampling reservoir size is increased during sampling.
- a key idea of PRJS is to exploit the property of reservoir sampling that the sampling probability keeps decreasing for each subsequent tuple. Based on this property, the memory required by the join buffer keeps decreasing during the join-sampling. Therefore, PRJS releases the join buffer memory not needed anymore and allocates it to the reservoir.
- PRJS is designed so that it determines how much the reservoir can be increased given a sample-use time and a uniformity confidence threshold.
- the present inventors have performed extensive experiments to evaluate the RJS and PRJS algorithms with respect to the two competing factors (size and uniformity of sample). The inventors have also compared the two algorithms in terms of the aggregation error resulting from applying AVG on the join result. The experimental results confirm understanding of the tradeoffs.
- the RJS and PRJS algorithms, as well as a description of the experiments, are presented and described below.
- Equation ⁇ 3 ⁇ As described above relative to Equation ⁇ 3 ⁇ and FIGS. 3-5 in connection with adaptive reservoir sampling, when the size of the sample reservoir is increased (i.e., the reservoir size is “progressively” increased), the uniformity confidence UC (Equation ⁇ 3 ⁇ ) will be less than 100%, increases monotonously and saturates as the uniformity confidence recovery tuple count m increases.
- progressive reservoir sampling is one case of adaptive reservoir sampling (Algorithm 2, above) wherein the size of the reservoir is only increased.
- a progressive reservoir sampling algorithm is as follows:
- Algorithm 4 Progressive Reservoir Sampling Inputs: r ⁇ reservoir size ⁇ k ⁇ number of tuples seen so far ⁇ ⁇ ⁇ uniformity confidence threshold ⁇ 1: while true do 2: while reservoir size does not increase do 3: conventional reservoir sampling (Algorithm 1, Background Section, above). 4: end while 5: Find the minimum value of m (using Equation ⁇ 3 ⁇ with the current values of k, r, ⁇ ) that causes the UC to exceed ⁇ . 6: flip a biased coin to decide on the number, x, of tuples to retain among r tuples already in the reservoir (Equation ⁇ 4 ⁇ ). 7: randomly evict r ⁇ x tuples from the reservoir. 8: select r + ⁇ ⁇ x tuples from the incoming m tuples using conventional reservoir sampling (Algorithm 1, Background section, above). 9: end while
- the progressive reservoir sampling works as shown in Algorithm 4. As long as the size of the reservoir does not increase, it uses the conventional reservoir sampling to sample the input stream (Line 3). Once the reservoir size increases by ⁇ , the algorithm computes the minimum value of m (using Equation ⁇ 3 ⁇ ) that causes the UC to exceed a given threshold ( ⁇ ) (Line 5). Then, the algorithm flips a biased coin to decide on the number of tuples (x) to retain among the r tuples already in the reservoir (Line 6). The probability of choosing the value x is defined in Equation ⁇ 4 ⁇ , above.
- the algorithm randomly evicts r ⁇ x tuples from the reservoir (Line 7) and refills the remaining reservoir space with r+ ⁇ x tuples from the arriving m tuples using the conventional reservoir sampling (Line 8).
- the algorithm continues sampling the input stream using the conventional reservoir sampling (Line 3) as if the sample in the enlarged reservoir were a uniform random sample of the k+m tuples.
- FIG. 11 illustrates a processing model 1100 of join sampling, i.e., uniform random sampling over a (two-way) join output stream.
- A is the join attribute.
- W i contains the
- Every join-result tuple may be classified as either an S 1 -probe join tuple or an S 2 -probe join tuple.
- S 1 -probe join tuple is defined symmetrically.
- a tuple s 1 ⁇ S 1 may first produce S 2 -probe join tuples when it arrives. Then, before it expires from W 1 , it may produce S 1 -probe join tuples with tuples newly arriving on S 2 .
- n 1 (s 1 ) is a function which returns the number of S 1 -probe join tuples produced by a tuple s 1 ⁇ S 1 before it expires from W 1 .
- n 2 (s 2 ) is defined symmetrically.
- the available memory M is limited, and insufficient for the join buffer to hold all tuples of the current sliding windows. It is assumed the initial reservoir size, r, is given. Under this join-sampling processing model, the present inventors have observed that as time passes memory requirement on the join buffer can be lowered and memory from the join buffer can be transferred to the reservoir. This makes the results of progressive reservoir sampling applicable to this processing model.
- each of the new RJS and PRJS algorithms may be considered to have two phases: 1) a join sampling phase and 2) a reservoir sampling phase.
- the sampling probabilities used in the first phase are denoted as p 1 and the sampling probability used in the second phase are denoted as p 2 .
- the join sampling phase utilizes a particular uniform join-sampling algorithm known as the “UNIFORM algorithm.”
- the UNIFORM algorithm (Algorithm 5) appears immediately below.
- (p 1 ) ⁇ geometric distribution ⁇ 9: s 1 .next s 1 .num + X 10: if s 1 .next > n 1 (s 1 ) then 11: discard s 1 12: end if
- Algorithm 5 outlines the steps of the algorithm for one-way join from S 1 to S 2 . (Join in the opposite, from S 1 to S 2 , is symmetric.)
- the algorithm works with two prediction models that provide n 1 (s 1 ): 1) a frequency-based model and 2) an age-based model.
- the frequency-based model assumes that, given a domain D of the join attribute A, for each value v ⁇ D a fixed fraction f 1 (v) of the tuples arriving on S 1 and a fixed fraction f 2 (v) of the tuples arriving on S 2 have value v of the attribute A.
- the age-based model assumes that for a tuple s 1 ⁇ S 1 the S 1 -probe join tuples produced by s 1 satisfies the conditions that 1) the number of S 1 -probe join tuples produced by s 1 is a constant independent of s 1 and 2) out of the n 1 (s 1 ) S 1 -probe join tuples of s 1 , a certain number of tuples is produced when s 1 is between the age g ⁇ 1 and g.
- These definitions are symmetric for a tuple s 2 ⁇ S 2 .
- the choice of a predictive model is not important to the novelty of concepts disclosed herein; thus, without loss of generality, the frequency-based model is used in the rest of this disclosure.
- n 1 (s 1 ) ⁇ 2 ⁇ W 1 ⁇ f 2 (S 1 ⁇ A)
- the join sampling probability p 1 is computed by first obtaining the expected memory usage (i.e., the expected number of tuples retained in the join buffer) in terms of p 1 and, then, equate this to the amount of memory available for performing the join and solving it for p 1 .
- the expected memory usage of W 1 thus obtained as:
- the algorithm proceeds as shown in Algorithm 5.
- the UNIFORM algorithm flips a coin with bias p 1 to decide the next S 1 -probe join tuple of s 1 (Lines 8-9).
- the UNIFORM algorithm picks X at random from the geometric distribution with parameter p 1 , G(p 1 ). If all remaining S 1 -probe join tuples of s 1 are rejected in the coin flips, s 1 is discarded (Lines 10-12).
- PRJS needs to know the values of m (uniformity confidence recovery tuple count) and ⁇ (uniformity confidence threshold). Given the time left until the sample-use (or collection) time (denoted as T), the number of tuples (denoted as l) that would be generated during T if there were no join sampling is computed as follows:
- the first step (Lines 2-13) concerns the memory transfer mechanism of PRJS. Initially there is no memory that can be transferred, since the memory utilization of the join buffer is 100%. As long as this is the case, PRJS works in the same way as RJS does (see Algorithm 6) except that, for each new tuple s i arriving on join input stream S i , p 1 is decreased to r/(k+1) and, accordingly, PRJS re-computes memory utilization of the join buffer.
- the reason for assigning this particular value to p 1 is that all S i -probe join tuples to be produced by s i while s i ⁇ W i should be sampled with effectively a probability of no more than r/(k+1).
- PRJS keeps decreasing p 1 and re-computing the memory utilization until it finds that some memory can be released from the join buffer and transferred to the reservoir.
- PRJS finds the largest amount of memory ( ⁇ ) that can be released from the join buffer and transferred to the reservoir, considering the following constraints:
- FIG. 12 shows the average sample size over time, at the interval of 10 time units, for both PRJS and RJS.
- PRJS the sample size increased linearly until the enlarged reservoir was filled, and then the increase saturated. The same happened for RJS, but sample size did not ever exceed the initial reservoir size.
- FIG. 13 shows the sample size over the first 1000 time units for a single run. Note that the sample size decreased initially because some sample tuples were evicted from the reservoir after x and y were decided. This is recovered quickly after that.
- FIG. 14 shows the effect of PRJS on the reservoir size for varying l, which was used instead of m because the value of m is an expected value for a given l (see Equation ⁇ 16 ⁇ ).
- the figure shows that the increase of size was larger for larger values of l.
- the effect saturated for relatively large values of l.
- the purpose of this set of experiments was to test the uniformity of the sample in the reservoir.
- the chi-squared ( ⁇ 2 ⁇ statistic was used as a metric of the sample uniformity. Higher ⁇ 2 indicates lower uniformity and vice versa.
- the ⁇ 2 statistic measures, for each value v in a domain D, the relative difference between the observed number of tuples (o(v)) and the expected number of tuples (e(v)) that contain the value. That is:
- FIG. 15 shows the ⁇ 2 statistic over time for both algorithms, at the interval of 100 time units.
- the underlying assumption was that the input stream is randomly sorted on the join attribute value.
- the results in the figure show that for PRJS the uniformity was decreased after the reservoir size was increased, but it started recovering before the sample-use time. As expected, the sample uniformity for RJS was better and was almost stable over time.
- FIG. 16 shows that, for PRJS, there was more damage on the uniformity when the degree of the input stream ordering was higher.
- RJS is not sensitive for any kind of ordering in the input stream. This is evident for RJS and, thus, the graph is omitted.
- ⁇ i is the aggregation result computed from a sample in the reservoir
- n is number of runs.
- any one or more of Algorithms 2-7, above may be conveniently implemented using one or more machines (e.g., general-purpose computing devices, devices incorporating application-specific integrated circuits, devices incorporating systems-on-chip, etc.) programmed according to the teachings of the present specification, as will be apparent to those of ordinary skill in the computer arts.
- machines e.g., general-purpose computing devices, devices incorporating application-specific integrated circuits, devices incorporating systems-on-chip, etc.
- Appropriate software coding can readily be prepared by skilled programmers based on the teachings of the present disclosure, as will be apparent to those of ordinary skill in the software art.
- Such software may be a computer program product that employs one or more machine-readable media and/or one or more machine-readable signals.
- a machine-readable medium may be any medium that is capable of storing and/or encoding a sequence of instructions for execution by a machine (e.g., a general purpose computing device) and that causes the machine to perform any one of the methodologies and/or embodiments described herein.
- Examples of a machine-readable medium in the form of a non-volatile machine-readable medium include, but are not limited to, a magnetic disk (e.g., a conventional floppy disk, a hard drive disk), an optical disk (e.g., a compact disk “CD”, such as a readable, writeable, and/or re-writable CD; a digital video disk “DVD”, such as a readable, writeable, and/or rewritable DVD), a magneto-optical disk, a read-only memory “ROM” device, a random access memory “RAM” device, a magnetic card, an optical card, a solid-state memory device (e.g., a flash memory), an EPROM, an EEPROM, and any combination thereof.
- a magnetic disk e.g., a conventional floppy disk, a hard drive disk
- an optical disk e.g., a compact disk “CD”, such as a readable, writeable, and/or re-w
- a machine-readable medium is intended to include a single medium as well as a collection of physically separate media, such as, for example, a collection of compact disks or one or more hard disk drives in combination with a computer memory.
- non-volatile as used above and in the amended claims excludes encoded signals that propagate via electromagnetic energy, pressure energy, or other form of energy.
- Examples of a computing device include, but are not limited to, a computer workstation, a terminal computer, a server computer, a handheld device (e.g., tablet computer, a personal digital assistant “PDA”, a mobile telephone, etc.), a web appliance, a network router, a network switch, a network bridge, a computerized device, such as a wireless sensor or dedicated proxy device, any machine capable of executing a sequence of instructions that specify an action to be taken by that machine, and any combination thereof.
- a computer workstation e.g., a terminal computer, a server computer, a handheld device (e.g., tablet computer, a personal digital assistant “PDA”, a mobile telephone, etc.), a web appliance, a network router, a network switch, a network bridge, a computerized device, such as a wireless sensor or dedicated proxy device, any machine capable of executing a sequence of instructions that specify an action to be taken by that machine, and any combination thereof.
- a handheld device e.g., tablet
- FIG. 18 shows a diagrammatic representation of one embodiment of a computing device in the exemplary form of a computer system 1800 within which a set of instructions for causing the device to perform any one or more of the aspects and/or methodologies of the present disclosure may be executed.
- Computer system 1800 includes a processor 1804 (e.g., a microprocessor) (more than one may be provided) and a memory 1808 that communicate with each other, and with other components, via a bus 1812 .
- Bus 1812 may include any of several types of bus structures including, but not limited to, a memory bus, a memory controller, a peripheral bus, a local bus, and any combination thereof, using any of a variety of bus architectures well known in the art.
- Memory 1808 may include various components including, but not limited to, a random access read/write memory component (e.g, a static RAM (SRAM), a dynamic RAM (DRAM), etc.), a read only component, and any combination thereof.
- a basic input/output system 1816 (BIOS), including basic routines that help to transfer information between elements within computer system 1800 , such as during start-up, may be stored in memory 1808 .
- Memory 1808 may also include (e.g., stored on one or more machine-readable media) instructions (e.g., software) 1820 embodying any one or more of the aspects and/or methodologies of the present disclosure.
- memory 1808 may further include any number of instruction sets including, but not limited to, an operating system, one or more application programs, other program modules, program data, and any combination thereof.
- Computer system 1800 may also include one or more storage devices 1824 .
- storage devices suitable for use as any one of the storage devices 1824 include, but are not limited to, a hard disk drive device that reads from and/or writes to a hard disk, a magnetic disk drive device that reads from and/or writes to a removable magnetic disk, an optical disk drive device that reads from and/or writes to an optical media (e.g., a CD, a DVD, etc.), a solid-state memory device, and any combination thereof.
- Each storage device 1824 may be connected to bus 1812 by an appropriate interface (not shown).
- Example interfaces include, but are not limited to, Small Computer Systems Interface (SCSI), advanced technology attachment (ATA), serial ATA, universal serial bus (USB), IEEE 13144 (FIREWIRE), and any combination thereof.
- storage device 1824 may be removably interfaced with computer system 1800 (e.g., via an external port connector (not shown)).
- storage device 1824 and an associated machine-readable medium 1828 may provide nonvolatile and/or volatile storage of machine-readable instructions, data structures, program modules, and/or other data and/or data storage for computer system 1800 .
- software 1820 may reside, completely or partially, within machine-readable medium 1828 .
- software 1820 may reside, completely or partially, within processor 1804 .
- computer system 1800 may also include one or more input devices 1832 .
- a user of computer system 1800 may enter commands and/or other information into the computer system via one or more of the input devices 1832 .
- Examples of input devices that can be used as any one of input devices 1832 include, but are not limited to, an alpha-numeric input device (e.g., a keyboard), a pointing device, a joystick, an audio input device (e.g., a microphone, a voice response system, etc.), a cursor control device (e.g., a mouse), a touchpad, an optical scanner, a video capture device (e.g., a still camera, a video camera), touchscreen, a digitizer pad, and any combination thereof.
- an alpha-numeric input device e.g., a keyboard
- a pointing device e.g., a joystick
- an audio input device e.g., a microphone, a voice response system, etc.
- Each input device 1832 may be interfaced to bus 1812 via any of a variety of interfaces (not shown) including, but not limited to, a serial interface, a parallel interface, a game port, a Universal Serial Bus (USB) interface, a FIREWIRE interface, a direct interface to the bus, a wireless interface (e.g., a Bluetooth® connection) and any combination thereof.
- Commands and/or other information may be input to computer system 1800 via storage device 1824 (e.g., a removable disk drive, a flash drive, etc.) and/or one or more network interface devices 1836 .
- a network interface device such as network interface device 1836 , may be utilized for connecting computer system 1800 to one or more of a variety of networks, such as network 1840 , and one or more remote devices 1844 connected thereto. Examples of a network interface device include, but are not limited to, a network interface card, a modem, a wireless transceiver (e.g., a Bluetooth® transceiver) and any combination thereof.
- Examples of a network include, but are not limited to, a wide area network (e.g., the Internet, an enterprise network), a local area network (e.g., a network associated with an office, a building, a campus, a group of wireless sensors or other group of data streaming devices, or other relatively small geographic space), a telephone network, a direct connection between two computing devices, and any combination thereof.
- a network such as network 1840 , may employ a wired and/or a wireless mode of communication. In general, any network topology may be used.
- Information e.g., data, software 1820 , etc.
- computer system 1800 may further include a video display adapter 1848 for communicating a displayable image to a display device, such as display device 1852 .
- a display device include, but are not limited to, a liquid crystal display (LCD), a cathode ray tube (CRT), a plasma display, and any combination thereof.
- a computer system 1800 may include one or more other peripheral output devices including, but not limited to, an audio speaker, a printer, and any combination thereof.
- peripheral output devices may be connected to bus 1812 via a peripheral interface 1856 .
- peripheral interface include, but are not limited to, a serial port, a USB connection, a FIREWIRE connection, a parallel connection, and any combination thereof.
- a digitizer (not shown) and an accompanying pen/stylus, if needed, may be included in order to digitally capture freehand input.
- a pen digitizer may be separately configured or coextensive with a display area of display device 1852 . Accordingly, a digitizer may be integrated with display device 1852 , or may exist as a separate device overlaying or otherwise appended to the display device.
Landscapes
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Complex Calculations (AREA)
Abstract
Description
Algorithm 1: Conventional Reservoir Sampling |
Inputs: r {reservoir size} | ||
1: k = 0 | ||
2: for each tuple arriving from the input stream do | ||
3: k = k + 1 | ||
4: if k ≦ r then | ||
5: add the tuple to the reservoir | ||
6: else | ||
7: sample the tuple with the probability r/k and replace a | ||
randomly selected tuple in the reservoir with the | ||
sampled tuple | ||
8: end if | ||
9: end for | ||
Algorithm 2: Adaptive Reservoir Sampling |
Inputs: r {reservoir size} |
k {number of tuples seen so far} |
ζ {uniformity confidence threshold} |
1: while true do |
2: while reservoir size does not change do |
3: conventional reservoir sampling (Algorithm 1 (Background |
section, above). |
4: end while |
5: if reservoir size is decreased by δ then |
6: randomly evicts δ tuples from the reservoir. |
7: else |
8: {i.e., reservoir size is increased by δ} |
9: Find the minimum value of m (using Equation {3}, below, with |
the current values of k, r, δ) that causes the UC to exceed ζ. |
10: flip a biased coin to decide on the number, x, of tuples to retain |
among r tuples already in the reservoir (Equation {4}, below). |
11: randomly evict r − x tuples from the reservoir. |
12: select r+δ−x tuples from the incoming m tuples using conventional |
reservoir sampling (Algorithm 1 (Background section, above). |
13: end if |
14:end while |
different S[r−δ]'s that can be selected in the reduced reservoir from the original reservoir. Note that there are
different S[r]'s that can be selected in the original reservoir from the k tuples and there are
duplicate S[r−δ]'s that can be selected in the reduced reservoir from the different S[r]'s. Therefore, there are
different S[r−δ]'s that can be selected in the reduced reservoir from the k tuples. On the other hand, the number of different samples of size r−δ that should be statistically possible from sampling k tuples is
Hence, the uniformity confidence is expressed as follows:
which clearly shows that the uniformity confidence is 100%.
different S[r+δ]'s for each x in the range [max {0, (r+δ)−m}, r]. On the other hand, the number of different samples of size r+6 that should be statistically possible from sampling k+m tuples is
Hence, with the eviction in place, the uniformity confidence is expressed as follows:
where m≧δ.
TABLE 1 | |||
Symbol | Description | ||
ν | number of streams (i.e., number of reservoirs) | ||
Si | stream i | ||
Ri | the reservoir allocated to Si | ||
M | total available memory for ν reservoirs | ||
t | current time point | ||
ri (t) | computed size of Ri at t | ||
ri M (t) | computed size of Ri at t with limited memory M | ||
ri (tu) | size of Ri adjusted at time point tu (tu < t) | ||
δi (t) | change in the size of Ri at t | ||
ki (t) | number of tuples seen up to t from Si | ||
mi (t) | number of tuples to be seen from Si, starting from | ||
t, to fill an enlarged reservoir Ri | |||
λi (t) | the average stream rate of Si | ||
T | time period left until the next data collection time | ||
ζ | uniformity confidence threshold | ||
φ | memory adjustment threshold (0 ≦ φ ≦ 1) | ||
where n is the sample size, N is the population size, and e is 1−confidence interval.
subject to the following limit on the total memory M:
It is assumed that M may not be large enough for all reservoirs. In this case, we use the heuristic of allocating the memory to each reservoir Ri in proportion to the value of ri(t) computed using Equation {6}. That is:
m i(t)=λi(t)×T {9}
For an enlarged reservoir Ri, the uniformity confidence expressed in Equation {3} is refined here as follows:
where mi(t)>δi(t).
where 0≦φ≦1.
Algorithm 3: Adaptive Multi-Reservoir Sampling |
Inputs: ζ, φ, M, T, {r1(tu), r2(tu), ..., rv(tu)}, { k1(t), k2(t), ..., kv(t)}, | ||
{λ1(t), λ2(t), ..., λv(t)} | ||
1: while true do | ||
2: while there are no tuples arriving from any stream do | ||
3: {do nothing.} | ||
4: end while | ||
{one or more tuples arrived from some streams} | ||
5: compute ri(t) (Equation {6}) for the streams from which tuples | ||
arrived. | ||
6: for each Ri ε { R1, R2, ..., Rv} do | ||
7: compute ri M(t) (Equation {8}). | ||
8: compute δi(t) = ri M(t) − ri(tu). | ||
9: end for | ||
10: if Equation {11} holds for any Ri ε { R1, R2, ..., Rv} then | ||
11: Lreduced = set of all Ri whose δi(t) < 0 | ||
12: Lenlarged = set of all Ri whose δi(t) > 0 | ||
13: compute mi(t) (Equation {9}) for all Ri εLenlarged. | ||
14: L′enlarged = set of all Ri εLenlarged whose UCi(ki(t), ri(tu), δi(t), | ||
mi(t)) ≦ ζ | ||
15: if L′enlarged is empty then | ||
16: for each Ri ε (Lreduced ∪ Lenlarged) do | ||
17: if Ri εLreduced then | ||
18: randomly evicts δi(t) tuples from Ri. | ||
19: else | ||
20: flip a biased coin to decide on the number of | ||
tuples, x, to retain in Ri (using Equation {4} with | ||
ki(t), ri(tu), δi(t), mi(t) substituting k,r,δ,m, | ||
respectively). | ||
21: randomly evict ri(tu) − x tuples from Ri. | ||
22: select ri(tu)+ δi(t)−x tuples from the incoming mi(t) | ||
tuples using Algorithm 1 (Background section, | ||
above). | ||
23: end if | ||
24: ri(tu) = ri M(t) | ||
25: end for | ||
26: end if | ||
27: end if | ||
28: end while | ||
In our experiments, we measure χ2 statistics for the humidity attribute. For this, we round the original real value of humidity to return the closest int to that original value.
Algorithm 4: Progressive Reservoir Sampling |
Inputs: | r {reservoir size} |
k {number of tuples seen so far} | |
ζ {uniformity confidence threshold} |
1: while true do |
2: while reservoir size does not increase do |
3: conventional reservoir sampling ( |
Section, above). |
4: end while |
5: Find the minimum value of m (using Equation {3} with the current |
values of k, r, δ) that causes the UC to exceed ζ. |
6: flip a biased coin to decide on the number, x, of tuples to retain |
among r tuples already in the reservoir (Equation {4}). |
7: randomly evict r − x tuples from the reservoir. |
8: select r + δ − x tuples from the incoming m tuples using conventional |
reservoir sampling ( |
9: end while |
Algorithm 5: Uniform Join-Sampling (UNIFORM) |
1: for each s2 in W2 where s2.A = s1.A do | ||
2: s2.num = s2.num + 1 | ||
3: if s2.num = s2.next then | ||
4: output s1 || s2 | ||
5: decide on the next s1 to join with s2 | ||
6: end if | ||
7: end for | ||
8: pick X ~ G | (p1) {geometric distribution} | ||
9: s1.next = s1.num + X | ||
10: if s1.next > n1(s1) then | ||
11: discard s1 | ||
12: end if | ||
A symmetric expression holds for the expected memory usage of W2, assuming the same sampling probability p1 for the S2-probe join tuples. That is,
TABLE 2 | |||
Symbol | Description | ||
Si | Data stream i (i = 1,2) | ||
λi | Rate of stream Si | ||
si | Tuple arriving in stream Si | ||
Wi | Sliding window on stream Si | ||
A | Join attribute (common to S1 and S2) | ||
Si-probe | Join tuple produced by si ∈ Wi | ||
ni(si) | Number of Si-probe join tuples produced by a | ||
tuple si ∈ Si before it expires from Wi | |||
S | Sample in a reservoir | ||
r | Initial reservoir size | ||
δ | Increment of a reservoir size | ||
k | Number of tuples seen so far in an input stream | ||
l | Number of tuples that would be generated without | ||
join-sampling by the time the reservoir sample will | |||
be used (or collected) | |||
RC | Reservoir refill confidence | ||
ξ | Reservoir refill confidence threshold | ||
UC | Uniformity confidence in a reservoir sample | ||
ζ | Uniformity confidence threshold | ||
m | Uniformity confidence recovery tuple count, i.e., | ||
number of tuples to be seen in an input stream of | |||
the progressive reservoir sampling until UC for the | |||
enlarged reservoir reaches ζ | |||
x | Number of tuples to be selected from k after | ||
increasing the reservoir size | |||
y | Number of tuples to be selected from m after | ||
increasing the reservoir size | |||
p1 | Join sampling probability in the first phase of the | ||
algorithms RJS and PRJS | |||
p2 | Reservoir sampling probability in the second | ||
phase of the algorithms RJS and PRJS | |||
Join Sampling—Reservoir Join Sampling
Algorithm 6: Reservoir Join-Sampling (RJS) |
1: k = 0 | ||
2: for each tuple output by UNIFORM do | ||
3: if k ≦ r then | ||
4: add the tuple to the reservoir | ||
5: else | ||
6: sample the tuple with the probability p2 = (r/(k + 1))/p1 | ||
7: end if | ||
8: k = k + (1/p1) | ||
9: end for | ||
Algorithm 7: Progressive Reservoir Join-Sampling (PRJS) |
1: k = 0 |
{Initially, the memory utilization of the join buffer is 100%.} |
2: while the memory utilization of the join buffer does not decrease do |
3: for each tuple output by the UNIFORM algorithm do |
4: if k ≦ r then |
5: add the tuple to the reservoir |
6: else |
7: sample the tuple with a probability p2 = (r/(k + 1))/p1 |
8: end if |
9: k = k + (1/p1) |
10: set p1 = r/(k + 1) {for the next incoming tuple} |
11: re-compute the memory utilization of the join buffer using |
Equations {13} and {14} |
12: end for |
13: end while |
14: while (RC(m) ≧ ξ) |
and (UC(Sr+δ) ≧ ζ) |
and (m ≧ (x + y) − (p1(k + 1))) do |
15: decrease p1 by a specified constant value |
16: re-compute the memory utilization of the join buffer using |
Equations {13} and {14} |
17: increase δ by the amount of unused memory |
18: end while |
19: while (RC(m) < ξ) |
or (UC(Sr+δ) < ζ) |
or (m < (x + y) − (p1(k + 1))) do |
20: δ = δ − 1 |
21: if δ = 0 then |
22: return |
23: end if |
24: end while |
25: release δ memory units from the join buffer and allocate the released |
memory to the reservoir. |
26: flip a biased coin to decide on x and y (Equation {4}) |
27: randomly evict r − x sample tuples from the reservoir |
28: get y sample tuples out of m using Algorithm 1 (Background section, |
above) |
29: continue sampling the input stream using Algorithm 1 (Background |
section, above) |
As mentioned above, PRJS proceeds in two phases: 1) a join-sampling phase and 2) a reservoir-sampling phase. Tuples in the join-sampling phase are sampled with a probability p1. Therefore, the expected number of tuples to be seen by the reservoir sampling phase (m) is:
m=lp 1 {16}
Given m and ζ, PRJS works as shown in
-
- Refill confidence: The refill confidence, RC, is defined as the probability that m is at least the same as the enlarged reservoir size. That is given r and δ:
RC(m)=probability(m>=r+δ) {17} - Unlike progressive reservoir sampling (see Algorithm 4), PRJS cannot guarantee that the enlarged reservoir will be filled out of m tuples since m is only an expected number of tuples on the outcome of the join-sampling phase (see Equation {16}). That is, the value of m is an expected value rather than an exact value. This means that actual value of m may be less than r+δ, and this implies that δ≦y≦min(m, r+δ). (y is the number of tuples to be selected from the m tuples). Therefore, the algorithm has to make sure that y falls in that range with a confidence no less than a given threshold ξ.
- Uniformity confidence: UC≧ζ. (See Equation {3}) That is, the uniformity confidence should be no less than ζ after the enlarged reservoir is filled.
- Uniformity-recovery tuple count: m≧(x+y)−(p1(k+1)). The rationale for this constraint is as follows. PRJS assumes the reservoir sample (of x+y tuples) will be used (or collected) after it will have seen m tuples. But if the sample-use does not happen, then it will have to continue with the conventional reservoir sampling on the join-sample tuples as if the sample in the reservoir were a uniform random sample of all join result tuples seen so far. In this case, (x+y)/((k+(m/p1))+1)≦p1. Hence, m≧(x+y)−(p1(k+1)).
If all these three constraints are satisfied, then in the second step PRJS keeps decreasing p1 and increasing δ until one or more of them are not satisfied anymore. The more p1 is decreased, the larger δ can be. Therefore, PRJS finds the smallest possible p1 that makes the three constraints satisfied. This ensures to find the largest possible memory (δ) to be transferred to the reservoir.
When PRJS enters the third step, δ has been set too large to satisfy one or more of the three constraint. So, PRJS decreases δ until the constraints are satisfied or δ becomes 0. The latter case means that the reservoir size cannot be increased. Once δ (>0) is determined, in the fourth step (Line 25-29) PRJS releases δ memory units from the join buffer and allocates the released memory to the reservoir. Then, PRJS works in the same way as in the progressive reservoir sampling (see Lines 6-8 of Algorithm 4) to refill the reservoir.
Join Sampling—Experimental Examples
- Refill confidence: The refill confidence, RC, is defined as the probability that m is at least the same as the enlarged reservoir size. That is given r and δ:
-
- Size of reservoir sample: Regardless of the initial reservoir size, PRJS eventually results in a reservoir larger than the fixed-size reservoir of RJS.
- Uniformity of reservoir sample: The RJS's sample uniformity is always no lower than PRJS's sample uniformity. For PRJS, the uniformity is degraded when the reservoir size is increased but starts recovering promptly and approaches toward 100% as additional join-sample tuples are generated.
- Aggregation on a reservoir sample: For all the experimental settings used, it has been observed from the results of aggregation errors on the reservoir sample that the benefit of gaining reservoir size is larger than the cost of losing sample uniformity. PRJS achieves smaller aggregation errors than RJS unless the initial reservoir size is too large for PRJS to have room for increasing the size.
where Ai (i=1, 2, . . . , n) is the exact aggregation result computed from the original join result and Âi is the aggregation result computed from a sample in the reservoir, and n is number of runs.
Claims (32)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/599,163 US8392381B2 (en) | 2007-05-08 | 2008-05-08 | Systems and methods for reservoir sampling of streaming data and stream joins |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US91666907P | 2007-05-08 | 2007-05-08 | |
US12/599,163 US8392381B2 (en) | 2007-05-08 | 2008-05-08 | Systems and methods for reservoir sampling of streaming data and stream joins |
PCT/US2008/063028 WO2008137978A1 (en) | 2007-05-08 | 2008-05-08 | Systems and methods for reservoir sampling of streaming data and stream joins |
Publications (2)
Publication Number | Publication Date |
---|---|
US20110313977A1 US20110313977A1 (en) | 2011-12-22 |
US8392381B2 true US8392381B2 (en) | 2013-03-05 |
Family
ID=39944032
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/599,163 Expired - Fee Related US8392381B2 (en) | 2007-05-08 | 2008-05-08 | Systems and methods for reservoir sampling of streaming data and stream joins |
Country Status (2)
Country | Link |
---|---|
US (1) | US8392381B2 (en) |
WO (1) | WO2008137978A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150269211A1 (en) * | 2014-03-19 | 2015-09-24 | International Business Machines Corporation | Evolution aware clustering of streaming graphs |
US9286350B2 (en) | 2013-06-28 | 2016-03-15 | International Business Machines Corporation | Estimating most frequent values for a data set |
US9880769B2 (en) | 2015-06-05 | 2018-01-30 | Microsoft Technology Licensing, Llc. | Streaming joins in constrained memory environments |
US9886521B2 (en) | 2014-03-13 | 2018-02-06 | International Business Machines Corporation | Adaptive sampling schemes for clustering streaming graphs |
US9942272B2 (en) | 2015-06-05 | 2018-04-10 | Microsoft Technology Licensing, Llc. | Handling out of order events |
US10148719B2 (en) | 2015-06-05 | 2018-12-04 | Microsoft Technology Licensing, Llc. | Using anchors for reliable stream processing |
US10673766B2 (en) * | 2017-10-24 | 2020-06-02 | Kangwon National University University-Industry Cooperation Foundation | Variable-size sampling method for supporting uniformity confidence under data-streaming environment |
US10868741B2 (en) | 2015-06-05 | 2020-12-15 | Microsoft Technology Licensing, Llc | Anchor shortening across streaming nodes |
CN113296709A (en) * | 2017-06-02 | 2021-08-24 | 伊姆西Ip控股有限责任公司 | Method and apparatus for deduplication |
US11507557B2 (en) | 2021-04-02 | 2022-11-22 | International Business Machines Corporation | Dynamic sampling of streaming data using finite memory |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8392381B2 (en) | 2007-05-08 | 2013-03-05 | The University Of Vermont And State Agricultural College | Systems and methods for reservoir sampling of streaming data and stream joins |
US9465846B2 (en) * | 2011-05-19 | 2016-10-11 | Hewlett Packard Enterprise Development Lp | Storing events from a datastream |
US20130110862A1 (en) * | 2011-10-27 | 2013-05-02 | Qiming Chen | Maintaining a buffer state in a database query engine |
US8768929B2 (en) | 2012-06-14 | 2014-07-01 | International Business Machines Corporation | Clustering streaming graphs |
WO2014204489A2 (en) * | 2013-06-21 | 2014-12-24 | Hitachi, Ltd. | Stream data processing method with time adjustment |
US11210604B1 (en) | 2013-12-23 | 2021-12-28 | Groupon, Inc. | Processing dynamic data within an adaptive oracle-trained learning system using dynamic data set distribution optimization |
US10657457B1 (en) | 2013-12-23 | 2020-05-19 | Groupon, Inc. | Automatic selection of high quality training data using an adaptive oracle-trained learning framework |
US10614373B1 (en) | 2013-12-23 | 2020-04-07 | Groupon, Inc. | Processing dynamic data within an adaptive oracle-trained learning system using curated training data for incremental re-training of a predictive model |
CN103826285A (en) * | 2014-03-20 | 2014-05-28 | 南京农业大学 | Cluster-head voting and alternating method for wireless sensor network |
US10650326B1 (en) * | 2014-08-19 | 2020-05-12 | Groupon, Inc. | Dynamically optimizing a data set distribution |
US10339468B1 (en) | 2014-10-28 | 2019-07-02 | Groupon, Inc. | Curating training data for incremental re-training of a predictive model |
US10728130B2 (en) * | 2016-04-21 | 2020-07-28 | Cisco Technology, Inc. | Distributed stateless inference of hop-wise delays and round-trip time for internet protocol traffic |
CN105898822A (en) * | 2016-05-24 | 2016-08-24 | 扬州大学 | Information passing method of wireless sensor network |
US20180181621A1 (en) * | 2016-12-22 | 2018-06-28 | Teradata Us, Inc. | Multi-level reservoir sampling over distributed databases and distributed streams |
CN107368556B (en) * | 2017-07-04 | 2020-10-20 | 广西电网有限责任公司电力科学研究院 | Power transmission line multi-source geographic information consistency matching system |
PL3857445T3 (en) * | 2018-09-26 | 2024-02-26 | Amgen Inc. | Image sampling for visual inspection |
US11108835B2 (en) * | 2019-03-29 | 2021-08-31 | Paypal, Inc. | Anomaly detection for streaming data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6532458B1 (en) | 1999-03-15 | 2003-03-11 | Microsoft Corporation | Sampling for database systems |
US6542886B1 (en) * | 1999-03-15 | 2003-04-01 | Microsoft Corporation | Sampling over joins for database systems |
WO2008137978A1 (en) | 2007-05-08 | 2008-11-13 | The University Of Vermont And State Agricultural College | Systems and methods for reservoir sampling of streaming data and stream joins |
US7536403B2 (en) * | 2006-12-22 | 2009-05-19 | International Business Machines Corporation | Method for maintaining a sample synopsis under arbitrary insertions and deletions |
-
2008
- 2008-05-08 US US12/599,163 patent/US8392381B2/en not_active Expired - Fee Related
- 2008-05-08 WO PCT/US2008/063028 patent/WO2008137978A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6532458B1 (en) | 1999-03-15 | 2003-03-11 | Microsoft Corporation | Sampling for database systems |
US6542886B1 (en) * | 1999-03-15 | 2003-04-01 | Microsoft Corporation | Sampling over joins for database systems |
US7536403B2 (en) * | 2006-12-22 | 2009-05-19 | International Business Machines Corporation | Method for maintaining a sample synopsis under arbitrary insertions and deletions |
WO2008137978A1 (en) | 2007-05-08 | 2008-11-13 | The University Of Vermont And State Agricultural College | Systems and methods for reservoir sampling of streaming data and stream joins |
Non-Patent Citations (5)
Title |
---|
International Search Report and Written Opinion dated Sep. 8, 2008, with regard to related PCT/US2008/063028 filed May 8, 2008, Al-Kateb et al. |
Jaewoo Kang et al, "Evaluating Window Joins Over Unbounded Streams", In: Proceedings of the 19th International Conference on IEEE Data Engineering, Edited by Umeshwar Dayal et al., Bangalore, India: IEEE, Mar. 5-8, 2003, pp. 341-352. |
Paul G. Brown et al., "Techniques for Warehousing of Sample Data", In: Proceedings of the 22th International Conference on IEEE Data Engineering, Edited by Ling Liu et al., Atlanta: IEEE, Apr. 3-7, 2006, pp. 6-6. |
Yan-Nei Law et al., "Load Shedding for Window Joins on Multiple Data Stream", In: Proceedings of the 23rd International Conference on IEEE Data Engineering Workshop, Edited by Vincent Oria et al, Istanbul: IEEE, Apr. 15-16, 2007, pp. 674-683. |
Yi-Leh Wu et al., "Query Estimation by Adaptive Sampling", In: Proceedings of the 18th International Conference on IEEE Data Engineering, Edited by Rakesh Agrawal et al., San Jose: IEEE, Feb. 26-Mar. 1, 2002, pp. 639-648. |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9286350B2 (en) | 2013-06-28 | 2016-03-15 | International Business Machines Corporation | Estimating most frequent values for a data set |
US10176231B2 (en) | 2013-06-28 | 2019-01-08 | International Business Machines Corporation | Estimating most frequent values for a data set |
US9886521B2 (en) | 2014-03-13 | 2018-02-06 | International Business Machines Corporation | Adaptive sampling schemes for clustering streaming graphs |
US9984109B2 (en) * | 2014-03-19 | 2018-05-29 | International Business Machines Corporation | Evolution aware clustering of streaming graphs |
US20150269211A1 (en) * | 2014-03-19 | 2015-09-24 | International Business Machines Corporation | Evolution aware clustering of streaming graphs |
US10148719B2 (en) | 2015-06-05 | 2018-12-04 | Microsoft Technology Licensing, Llc. | Using anchors for reliable stream processing |
US9942272B2 (en) | 2015-06-05 | 2018-04-10 | Microsoft Technology Licensing, Llc. | Handling out of order events |
US9880769B2 (en) | 2015-06-05 | 2018-01-30 | Microsoft Technology Licensing, Llc. | Streaming joins in constrained memory environments |
US10868741B2 (en) | 2015-06-05 | 2020-12-15 | Microsoft Technology Licensing, Llc | Anchor shortening across streaming nodes |
CN113296709A (en) * | 2017-06-02 | 2021-08-24 | 伊姆西Ip控股有限责任公司 | Method and apparatus for deduplication |
CN113296709B (en) * | 2017-06-02 | 2024-03-08 | 伊姆西Ip控股有限责任公司 | Method and apparatus for deduplication |
US10673766B2 (en) * | 2017-10-24 | 2020-06-02 | Kangwon National University University-Industry Cooperation Foundation | Variable-size sampling method for supporting uniformity confidence under data-streaming environment |
US11507557B2 (en) | 2021-04-02 | 2022-11-22 | International Business Machines Corporation | Dynamic sampling of streaming data using finite memory |
Also Published As
Publication number | Publication date |
---|---|
WO2008137978A1 (en) | 2008-11-13 |
US20110313977A1 (en) | 2011-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8392381B2 (en) | Systems and methods for reservoir sampling of streaming data and stream joins | |
Lai et al. | Oort: Efficient federated learning via guided participant selection | |
US20220214918A1 (en) | Memory usage determination techniques | |
US20230024501A1 (en) | System and Method for Throughput Prediction for Cellular Networks | |
US8060461B2 (en) | System and method for load shedding in data mining and knowledge discovery from stream data | |
CN103002005A (en) | Cloud service monitoring system | |
US11748326B2 (en) | Data flow control device and data flow control method | |
CN108366082A (en) | Expansion method and flash chamber | |
Lai et al. | Oort: Informed participant selection for scalable federated learning | |
Al-Kateb et al. | Adaptive-size reservoir sampling over data streams | |
Apiletti et al. | Energy-saving models for wireless sensor networks | |
Ghosh et al. | Computing worst-case input models in stochastic simulation | |
Han et al. | Utility-maximizing data collection in crowd sensing: An optimal scheduling approach | |
CN114423023A (en) | 5G network edge server deployment method facing mobile users | |
Agca | A holistic abstraction to ensure trusted scaling and memory speed trusted analytics | |
WO2017196743A1 (en) | Correlation of thread intensity and heap usage to identify heap-hoarding stack traces | |
Cafaro et al. | Data stream fusion for accurate quantile tracking and analysis | |
Huang | The value-of-information in matching with queues | |
Wang et al. | Mitigating bottlenecks in wide area data analytics via machine learning | |
Xia et al. | QoS-Aware data replications and placements for query evaluation of big data analytics | |
Mayer | Window-based data parallelization in complex event processing | |
Xia et al. | Learning-based online query evaluation for big data analytics in mobile edge clouds | |
Munige | Near Real-Time Processing of Voluminous, High-Velocity Data Streams for Continuous Sensing Environments | |
CN115776445B (en) | Traffic migration-oriented node identification method, device, equipment and storage medium | |
US20220413986A1 (en) | Tenant database placement in oversubscribed database-as-a-service cluster |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NATIONAL SCIENCE FOUNDATION, VIRGINIA Free format text: CONFIRMATORY LICENSE;ASSIGNOR:UNIVERSITY OF VERMONT & ST AGRIC COLLEGE;REEL/FRAME:024391/0619 Effective date: 20100105 |
|
AS | Assignment |
Owner name: THE UNIVERSITY OF VERMONT AND STATE AGRICULTURAL C Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:AL-KATEB, MOHAMMED;LEE, BYUNG SUK;WANG, XIAOYANG;SIGNING DATES FROM 20091106 TO 20110831;REEL/FRAME:026909/0237 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20210305 |