CN110704515B

CN110704515B - Two-stage online sampling method based on MapReduce model

Info

Publication number: CN110704515B
Application number: CN201911267526.5A
Authority: CN
Inventors: 谭皓予
Original assignee: Sichuan XW Bank Co Ltd
Current assignee: Sichuan XW Bank Co Ltd
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2020-06-02
Anticipated expiration: 2039-12-11
Also published as: CN110704515A

Abstract

The invention relates to a two-stage online sampling method based on a MapReduce model, which comprises the following steps: step 1: sampling in a first stage: setting a whole group of samplers before performing online processing on a map end in the MapReduce model, and sampling by taking a data block as a sample unit; step 2: obtaining an estimated value of a query result in a query stage of the MapReduce model, and calculating the width of a confidence interval when a confidence degree is given; and step 3: sampling in the second stage, wherein before the reduce end starts processing, the probability of each data block being sampled by the reduce end is corrected through an acceptance-rejection sampler; and 4, step 4: and performing aggregation processing on the discarded map end output result in a recycle bin of the reduce end, and adding a snapshot result obtained by the accepted data block to obtain an actual result of the aggregated query. The invention ensures the randomness of the sample on the premise of not increasing the network transmission cost, provides effective statistical estimation, eliminates the bias influence of data inclination on the statistical estimation, and ensures the unbiased property and the effectiveness of the query estimation.

Description

Two-stage online sampling method based on MapReduce model

Technical Field

The invention relates to a method for online sampling of data, in particular to a two-stage online sampling method based on a MapReduce model.

Background

With the development of information digitization, the global data volume is explosively increased, and data mining and data analysis based on big data become hot spots of wide attention in various fields. On-Line Aggregation (OLA) technology provides a method for quickly returning an approximate result according to sample data so as to meet the requirements of real-time processing and quick user interaction. In the query processing process, compared with an offline batch processing technology, the online aggregation technology can return an estimation result and a result confidence interval in a certain confidence degree in a shorter time, and continuously return an approximate result in the processing process, and the estimation quality is continuously improved along with the increase of processing data. The online processing mode which can quickly return results without waiting until all data processing is completed enables a user to timely master the query dynamics. When the user is satisfied with the accuracy of the estimated results, the query may be selected to be terminated, thereby saving processing time and computational resources.

The online aggregation technique was born in the ninety-year domain of relational databases, initially directed to aggregation operations on single tables, and later extended to aggregated queries on a multi-table-connected basis. At present, some preliminary progress has been made in research on Online aggregation in a MapReduce (a distributed computing model for large-scale data processing) environment, for example, type et al proposes an Online Hadoop Prototype system (HOP), which provides an implementation platform for Online aggregation in a MapReduce environment. Pansar et al propose an online aggregation method based on "test paradox" and bayes theory.

The MapReduce model is used for processing and producing large-scale data set related implementation, and adopts a functional programming idea. The MapReduce model contains 2 stages: map and reduce, so when writing a MapReduce program, a user generally needs to implement two functions: map function and reduce function. The map function accepts a key-value pair as input and produces a set of intermediate key-value pairs as output. The MapReduce framework passes the intermediate key values generated by the map function to a reduce function with the same value for the key.

In the existing online aggregation research, an estimation algorithm based on the central limit theorem is widely applied, and confidence interval width are provided as accuracy measurement standards while the aggregation result is estimated. After a query statement is given, the method constructs random variables by performing statistical modeling according to the type of the aggregation operationXAnd make it possible toXIs the final aggregated result, thereby converting the resulting estimate into an estimate of the overall mean. And estimating the overall mean and variance by using random samples by using the central limit theorem, thereby finishing the estimation of the aggregation result.

For aggregated query statements: SELECTaggr(exp(t))FROMTWhereinaggrWhich represents the type of the aggregation operation,exprepresentative watchTAlgebraic manipulation of the attributes of (1). Constructing statistical variables

WhereinpIs the predicate condition of filtering, then the overall meanμI.e., aggregated query results. And expanding the sample mean value through the size ratio of the sample to the total data volume to obtain the total mean value, namely the aggregation result to be counted. Given confidence (1-α) For each estimated result, a confidence interval of the real result can be calculated based on the central limit theorem and expressed as

。

An estimate value and a confidence interval width of the query result of the aggregated query statement for the MapReduce model

Wherein, in the step (A),αthe level of significance is indicated as being the level of significance,

representing a normal distributionαDifferential location of points and pattern

Instead of the overall variance

，nIndicating the sample size. The user can measure the accuracy of the estimation result by means of the confidence degree and the width of the confidence interval, the width of the confidence interval gradually narrows with the continuous query, and the user can judge whether the query is terminated in advance according to the width of the interval.

The on-line aggregation method based on the central limit theorem has high requirements on a sampling algorithm, and the sample quality directly influences the accuracy of an estimation result and the convergence speed of a confidence interval. Generally, an online aggregation system requires sample data to have randomness, and an online aggregation method in the field of traditional relational databases samples at tuple granularity. In the related research work of online aggregation, there are three main types of methods for sampling from the relational data table: sequential scanning, index scanning, and index sampling. Sai Wu et al propose a method for sampling from a distributed data table when studying the problem of online aggregation in a distributed environment, first calculate the size of data amount to be sampled for each node according to the distribution of the data table on each node, and then perform index sampling on each node in equal proportion.

Since data is stored and processed in units of blocks in a MapReduce environment, a block-based random sampling algorithm is generally used in such an environment in consideration of sampling efficiency. However, random sampling of block granularity in a distributed environment such as MapReduce cannot guarantee the independent and same-distribution properties of samples. For each mapper (mapper), although the input data blocks are almost equal in size, data skew may still occur at the map end, for example, the distribution proportion of valid data satisfying the selection filtering predicate in each data block is different, so that the intermediate result sets generated by each mapper have different sizes, resulting in different processing times of the mappers. Furthermore, there is some correlation between the length of the mapper processing time and the size of the aggregate value of the data block. Those blocks that contain less valid data may take less time to complete processing at the map stage, arriving earlier at the reduce end, and there is a greater probability that such blocks contain a smaller aggregate value. At any time in the query processing process, the sample set is observed, wherein the probability of the occurrence of blocks with smaller aggregation values is higher, and the samples cannot be regarded as independent and identically distributed random variables, so that the unbiased property of the estimation algorithm is influenced, and the accuracy of the estimation result is influenced.

Disclosure of Invention

The invention provides a two-stage online sampling method based on a MapReduce model, which is used for eliminating bias influence on statistical estimation caused by data inclination in a distributed processing environment so as to ensure unbiased property and effectiveness of query estimation.

The invention relates to a two-stage online sampling method based on a MapReduce model, which comprises the following steps:

A. sampling in a first stage: when a MapReduce model receives input data of an upstream data node and initializes the input data, a whole group sampler is arranged before online processing is carried out on a map end (a mapping end), each data block is divided into one group, and sampling is carried out by taking the data block as a sample unit;

B. in the online query stage of the MapReduce model, the data blocks are used as observation units to construct variables

WhereinNIs the total number of data blocks in the data table,

as a block of dataB _iThe aggregate result of all the tuples meeting the predicate condition is obtained, and the estimation value of the query result is

，nFor the number of samples sampled, according to the central limit theorem, variablesY _iIs approximated by an overall mean ofμVariance ofσ ²Normal distribution of (2); at a given confidence level (1-α) Then, the confidence interval width is calculated as:

confidence interval of the aggregated results is

representing a normal distributionαThe position of the branch point is divided into two parts,

is the sample variance;

C. and the second stage of sampling, before the reduction end (reduction end) starts to process, correcting the probability of each data block being sampled by the reduction end by an acceptance-rejection sampler: when the map end has a new output result, an Acceptance-Rejection Sampling (A/R Sampling) randomly generates a random number between 0 and 1uIf the random number isuIf the receiving probability is less than or equal to the receiving probability, receiving the output result of the map end and allowing the output result to carry out subsequent aggregation processing, otherwise, discarding the output result; the output result of the map end accepted by the acceptance-rejection sampler is random uniform samples based on blocks, wherein the data blocksB _iProbability of acceptance of

，b _iRepresenting blocks of dataB _iThe number of valid tuples contained in it,b _maxrepresenting the maximum number of significant tuples contained in all data blocks,βreferred to as the adjustment factor, is,

；

D. and C, setting a recycle bin at the reduce end, entering the output result of the map end discarded by the acceptance-rejection sampler into the recycle bin for aggregation processing, and adding the aggregation processing result in the recycle bin to each Snapshot result (Snapshot) obtained in the aggregation processing of the accepted output result of the map end in the step C after the output result of the map end in the recycle bin is completely aggregated to obtain the actual result of the aggregation query. The core idea of the aggregation processing is that with the increasing proportion of the sampled data, the aggregation value is estimated after each sampling, and each estimation is a snapshot of the final aggregation result. The larger the proportion of the sampled data to the total data is, the more accurate the estimated aggregation result is, and the more convergent the estimation interval is. When the sampling progress does not reach 100%, the aggregation processing is carried out on the estimation of the aggregation result. Meanwhile, since the estimation can only be performed based on random samples, it is also the reason that a part of the output results of the map terminal needs to be discarded in step C. When the sampling progress reaches 100%, all data are processed in an aggregation mode, and a final accurate aggregation result can be calculated.

The whole group Sampling (Cluster Sampling) adopted by the whole group sampler is a Sampling mode that each unit in the whole group is grouped into a plurality of sets which are not crossed and repeated mutually, and the sets are called as groups (clusters), and then the groups are used as Sampling units to extract samples. When the whole group sampling is applied, each group is required to have better representativeness, namely, the difference of each unit in the group is large, and the difference between the groups is small.

In order to eliminate the bias problem caused by data inclination and MapReduce distributed processing, the invention ensures the effectiveness and unbiased property of a statistical algorithm by a method of accepting-rejecting sampling, so that the invention can process online gathering query of any inclined data in a MapReduce environment and can be widely applied to various distributed environments.

Further, in step a, the whole group of samplers maintains a data block random queue for each data table, one data block random queue contains data blocks corresponding to a plurality of data tables, each data block random queue corresponds to a mapper (mapper), the sequence of all the data blocks in the data block random queue is randomized, a mapper is assigned by the map end at each scheduling, and when an input data of the upstream data node is requested to be received, the mapper iteratively returns the data block at the head of the queue from the corresponding data block random queue.

Specifically, the aggregation in step B is calculated as:

the map function filters input tuples according to predicates selected by WHERE clauses in SQL query statements, and projects necessary attribute columns in a key-value pair mode, wherein a key value domain is the tuplestValue of the value field part equal to

；

B2. The values belonging to the same group in a data block are accumulated by a combination function, the value range part of the output value of which comprises two values: the first value is

Obtained by local aggregation; the second value is

For estimating the result of the preprocessing, for calculating the variance of the samplesσ ²(ii) a Calculating sample varianceσ ²The method of (A) is a conventional method, and the general formula is as follows: variance = mean of squares-square of mean.

B3. All key-value pairs belonging to the same group are transferred to the same reducer.

Further, in step C, after a new output result exists at the map end, the value domain part of the key-value pair in the output result is the number of valid tuples contained in the corresponding data block counted by the combination function, the acceptance-rejection sampler reads the number of the valid tuples through the key-value pair, calculates the acceptance probability of the corresponding data block, and then generates a random numberu。

On the basis, in step C, the acceptance probability of the data block without the valid tuple is set to be

Whereinb _minRepresenting the minimum of the number of significant elements contained in all data blocks.

The two-stage online sampling method based on the MapReduce model ensures the randomness of samples on the premise of not increasing the network transmission cost through the online sampling of the two stages and the whole group sampling strategy taking a data block as a unit, provides effective statistical estimation based on the block granularity, and obviously eliminates the bias influence on the statistical estimation caused by the data inclination under the distributed processing environment, thereby ensuring the unbiased property and the effectiveness of the query estimation.

The present invention will be described in further detail with reference to the following examples. This should not be understood as limiting the scope of the above-described subject matter of the present invention to the following examples. Various substitutions and alterations according to the general knowledge and conventional practice in the art are intended to be included within the scope of the present invention without departing from the technical spirit of the present invention as described above.

Drawings

FIG. 1 is a flow chart of a two-stage online sampling method based on a MapReduce model.

Detailed Description

As shown in FIG. 1, the two-stage online sampling method based on the MapReduce model of the invention comprises the following steps:

A. sampling in a first stage: when the MapReduce model receives input data of an upstream data node and initializes the input data, a whole group sampler is arranged before online processing is carried out on a map end, each data block is divided into one group, and sampling is carried out by taking the data block as a sample unit. The whole group of samplers maintain a data block random queue for each data table, one data block random queue contains data blocks corresponding to a plurality of data tables, each data block random queue corresponds to a mapper (mapper), the sequence of all the data blocks in the data block random queue is randomized, a mapper is assigned by a map end during each scheduling, and when input data of an upstream data node is requested to be received, the mapper iteratively returns data blocks at the head of the queue from the corresponding data block random queue.

WhereinNIs the total number of data blocks in the data table,

，nFor the number of samples sampled, according to the central limit theorem, variablesY _iIs approximated by an overall mean ofμVariance ofσ ²Is normally distributed.

Wherein the aggregation is calculated as:

；

Obtained by local aggregation; the second value is

For estimating the result of the preprocessing, for calculating the variance of the samplesσ ²；

Then at a given confidence level (1-α) Then, the confidence interval width is calculated as:

confidence interval of the aggregated results is

is the sample variance.

C. In the MapReduce online query processing stage, due to the existence of data inclination, the probability that a data block with a small aggregation result reaches the reduce end is higher, and further the estimation result is lower. Therefore, in order to eliminate the bias effect caused by the data tilt, the second stage Sampling is required, and before the reduce end starts to process, the probability that each data block is drawn by the reduce end is corrected by an Acceptance-Rejection sampler (a/R Sampling): when there is a new output result at the map end, the value range part of the key-value pair in the output result is counted by the combination functionAnd corresponding to the number of the effective tuples contained in the data block, the acceptance-rejection sampler reads the number of the effective tuples through the key-value pair, and then calculates the acceptance probability of the corresponding data block. Then the acceptance-rejection sampler randomly generates a random number between 0 and 1uIf the random number isuAnd if the receiving probability is less than or equal to the receiving probability, receiving the output result of the map end and allowing the output result to carry out subsequent aggregation processing, otherwise, discarding the output result. Since the estimation in the aggregation process can only be performed based on random samples, a part of the output results at the map end needs to be discarded. Thus, the output result of the map end accepted by the acceptance-rejection sampler is a random uniform sample based on a block, wherein the data blockB _iProbability of acceptance of

。

setting different adjustment factors according to the inclination degree of data in different scenesβTo avoid over-correction. In a severely skewed data set, the number of valid tuples contained in an individual data block is much larger than the mean of the valid tuples of the data block population, so that when the acceptance-rejection sampling is performed, the acceptance probability of most data blocks is a very small value, that is, most data blocks are discarded by the reduce end, and the estimation result cannot be updated later, that is, "overcorrection" occurs. To avoid overcorrection, adjustment factors are usedβTo the reception probabilityα _iAdjusting, setting adjustment factors to different values according to the inclination degree of data in different application scenes, and ensuring the acceptance probability on one handα _iBetween 0 and 1, on the other hand, the phenomenon of 'over correction' can be eliminated, and the speed is ensured to be fastAnd (4) outputting the result stably.

In addition, for data blocks without valid tuples, the acceptance probability is set to

D. And C, setting a recycle bin at the reduce end, entering the output result of the map end discarded by the acceptance-rejection sampler into the recycle bin for aggregation processing, and adding the aggregation processing result in the recycle bin to each Snapshot result (Snapshot) obtained in the aggregation processing of the accepted output result of the map end in the step C after the output result of the map end in the recycle bin is completely aggregated to obtain the actual result of the aggregation query.

The invention ensures the effectiveness and unbiasedness of a statistical algorithm by adopting a method of accepting-rejecting sampling in a two-stage online sampling mode, so that the method can process online aggregated query of any inclined data in a MapReduce environment, and can be widely applied to various distributed environments.

Claims

1. The two-stage online sampling method based on the MapReduce model is characterized by comprising the following steps of:

A. sampling in a first stage: when a MapReduce model receives input data of an upstream data node and initializes the input data, a whole group sampler is arranged before online processing is carried out on a map end, each data block is divided into one group, and sampling is carried out by taking the data block as a sample unit;

B. in the online query stage of the MapReduce model, the data block is used as an observation unit to construct a variable Y_i＝N*exp_p(B_i) Where N is the total number of data blocks in the data table, exp_p(B_i) As a data block B_iThe aggregate result of all the tuples meeting the predicate condition is obtained, and the estimation value of the query result is

n is a sample ofThis quantity, according to the central limit theorem, variable Y_iThe mean distribution of (A) is approximated by an overall mean of μ and a variance of σ²When the confidence coefficient (1- α) is given, the confidence interval width is calculated as follows:

confidence interval of the aggregated results is

Wherein α denotes the level of significance, z_α/2The α quantile of a normal distribution is shown,

is the sample variance;

C. and the second stage of sampling corrects the probability of each data block being sampled by the reduce end through an acceptance-rejection sampler before the reduce end starts processing: when a new output result exists at the map end, randomly generating a random number u between 0 and 1 by the acceptance-rejection sampler, if the random number u is less than or equal to the acceptance probability, accepting the output result of the map end and allowing the output result to carry out subsequent aggregation processing, otherwise, discarding the output result; the output result of the map end accepted by the acceptance-rejection sampler is a random uniform sample based on a block, wherein the data block B_iProbability of acceptance of

b_iRepresents a data block B_iNumber of valid tuples contained in, b_maxThe maximum value of the number of the effective element groups contained in all the data blocks is represented by β and is called as an adjustment factor, wherein 0 is more than or equal to β is less than or equal to 1;

D. and C, setting a recycle bin at the reduce end, entering the output result of the map end discarded by the acceptance-rejection sampler into the recycle bin for aggregation processing, and adding the aggregation processing result in the recycle bin to each snapshot result obtained in the aggregation processing of the accepted output result of the map end in the step C to obtain the actual result of the aggregation query after the output results of the map end in the recycle bin are all aggregated.

2. The MapReduce model-based two-stage online sampling method of claim 1, wherein: in step A, the whole group of samplers maintains a data block random queue for each data table, one data block random queue contains data blocks corresponding to a plurality of data tables, each data block random queue corresponds to a mapper respectively, the sequence of all the data blocks in the data block random queue is randomized, one mapper is assigned by a map end during each scheduling, and when input data of an upstream data node is requested to be received, the mapper iteratively returns data blocks at the head of the queue from the corresponding data block random queue.

3. The MapReduce model-based two-stage online sampling method of claim 1, wherein: the aggregation in step B is calculated as:

the map function filters the input tuples according to predicates selected by the WHERE clause in the SQL query statement and projects necessary attribute columns in the form of key-value pairs, wherein the key value range is the grouped attribute value of the tuple t, and the value range part is equal to exp_p(t)；

B2. The values belonging to the same group in a data block are accumulated by a combination function, the value range part of the output value of which comprises two values: the first value being exp_p(B_i) Obtained by local aggregation; the second value is exp_p(B_i)²For estimating the result of the preprocessing, for calculating the variance σ of the sample²；

4. The MapReduce model-based two-stage online sampling method of claim 1, wherein: in step C, when a new output result exists at the map end, the value domain part of the key-value pair in the output result is the number of effective tuples contained in the corresponding data block counted by the combination function, the acceptance-rejection sampler reads the number of the effective tuples through the key-value pair, calculates the acceptance probability of the corresponding data block and then generates a random number u.

5. A MapReduce model-based two-stage online sampling method as defined in any one of claims 1 to 4, wherein: in step C, the acceptance probability of the data block without the effective tuple is set to be

Wherein b is_minRepresenting the minimum of the number of significant elements contained in all data blocks.