CN110704515B - Two-stage online sampling method based on MapReduce model - Google Patents

Two-stage online sampling method based on MapReduce model Download PDF

Info

Publication number
CN110704515B
CN110704515B CN201911267526.5A CN201911267526A CN110704515B CN 110704515 B CN110704515 B CN 110704515B CN 201911267526 A CN201911267526 A CN 201911267526A CN 110704515 B CN110704515 B CN 110704515B
Authority
CN
China
Prior art keywords
data
data block
result
value
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911267526.5A
Other languages
Chinese (zh)
Other versions
CN110704515A (en
Inventor
谭皓予
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan XW Bank Co Ltd
Original Assignee
Sichuan XW Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan XW Bank Co Ltd filed Critical Sichuan XW Bank Co Ltd
Priority to CN201911267526.5A priority Critical patent/CN110704515B/en
Publication of CN110704515A publication Critical patent/CN110704515A/en
Application granted granted Critical
Publication of CN110704515B publication Critical patent/CN110704515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a two-stage online sampling method based on a MapReduce model, which comprises the following steps: step 1: sampling in a first stage: setting a whole group of samplers before performing online processing on a map end in the MapReduce model, and sampling by taking a data block as a sample unit; step 2: obtaining an estimated value of a query result in a query stage of the MapReduce model, and calculating the width of a confidence interval when a confidence degree is given; and step 3: sampling in the second stage, wherein before the reduce end starts processing, the probability of each data block being sampled by the reduce end is corrected through an acceptance-rejection sampler; and 4, step 4: and performing aggregation processing on the discarded map end output result in a recycle bin of the reduce end, and adding a snapshot result obtained by the accepted data block to obtain an actual result of the aggregated query. The invention ensures the randomness of the sample on the premise of not increasing the network transmission cost, provides effective statistical estimation, eliminates the bias influence of data inclination on the statistical estimation, and ensures the unbiased property and the effectiveness of the query estimation.

Description

Two-stage online sampling method based on MapReduce model
Technical Field
The invention relates to a method for online sampling of data, in particular to a two-stage online sampling method based on a MapReduce model.
Background
With the development of information digitization, the global data volume is explosively increased, and data mining and data analysis based on big data become hot spots of wide attention in various fields. On-Line Aggregation (OLA) technology provides a method for quickly returning an approximate result according to sample data so as to meet the requirements of real-time processing and quick user interaction. In the query processing process, compared with an offline batch processing technology, the online aggregation technology can return an estimation result and a result confidence interval in a certain confidence degree in a shorter time, and continuously return an approximate result in the processing process, and the estimation quality is continuously improved along with the increase of processing data. The online processing mode which can quickly return results without waiting until all data processing is completed enables a user to timely master the query dynamics. When the user is satisfied with the accuracy of the estimated results, the query may be selected to be terminated, thereby saving processing time and computational resources.
The online aggregation technique was born in the ninety-year domain of relational databases, initially directed to aggregation operations on single tables, and later extended to aggregated queries on a multi-table-connected basis. At present, some preliminary progress has been made in research on Online aggregation in a MapReduce (a distributed computing model for large-scale data processing) environment, for example, type et al proposes an Online Hadoop Prototype system (HOP), which provides an implementation platform for Online aggregation in a MapReduce environment. Pansar et al propose an online aggregation method based on "test paradox" and bayes theory.
The MapReduce model is used for processing and producing large-scale data set related implementation, and adopts a functional programming idea. The MapReduce model contains 2 stages: map and reduce, so when writing a MapReduce program, a user generally needs to implement two functions: map function and reduce function. The map function accepts a key-value pair as input and produces a set of intermediate key-value pairs as output. The MapReduce framework passes the intermediate key values generated by the map function to a reduce function with the same value for the key.
In the existing online aggregation research, an estimation algorithm based on the central limit theorem is widely applied, and confidence interval width are provided as accuracy measurement standards while the aggregation result is estimated. After a query statement is given, the method constructs random variables by performing statistical modeling according to the type of the aggregation operationXAnd make it possible toXIs the final aggregated result, thereby converting the resulting estimate into an estimate of the overall mean. And estimating the overall mean and variance by using random samples by using the central limit theorem, thereby finishing the estimation of the aggregation result.
For aggregated query statements: SELECTaggr(exp(t))FROMTWhereinaggrWhich represents the type of the aggregation operation,exprepresentative watchTAlgebraic manipulation of the attributes of (1). Constructing statistical variables
Figure DEST_PATH_IMAGE001
WhereinpIs the predicate condition of filtering, then the overall meanμI.e., aggregated query results. And expanding the sample mean value through the size ratio of the sample to the total data volume to obtain the total mean value, namely the aggregation result to be counted. Given confidence (1-α) For each estimated result, a confidence interval of the real result can be calculated based on the central limit theorem and expressed as
Figure DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
An estimate value and a confidence interval width of the query result of the aggregated query statement for the MapReduce model
Figure DEST_PATH_IMAGE004
Wherein, in the step (A),αthe level of significance is indicated as being the level of significance,
Figure 66021DEST_PATH_IMAGE005
representing a normal distributionαDifferential location of points and pattern
Figure DEST_PATH_IMAGE006
Instead of the overall variance
Figure 311058DEST_PATH_IMAGE007
nIndicating the sample size. The user can measure the accuracy of the estimation result by means of the confidence degree and the width of the confidence interval, the width of the confidence interval gradually narrows with the continuous query, and the user can judge whether the query is terminated in advance according to the width of the interval.
The on-line aggregation method based on the central limit theorem has high requirements on a sampling algorithm, and the sample quality directly influences the accuracy of an estimation result and the convergence speed of a confidence interval. Generally, an online aggregation system requires sample data to have randomness, and an online aggregation method in the field of traditional relational databases samples at tuple granularity. In the related research work of online aggregation, there are three main types of methods for sampling from the relational data table: sequential scanning, index scanning, and index sampling. Sai Wu et al propose a method for sampling from a distributed data table when studying the problem of online aggregation in a distributed environment, first calculate the size of data amount to be sampled for each node according to the distribution of the data table on each node, and then perform index sampling on each node in equal proportion.
Since data is stored and processed in units of blocks in a MapReduce environment, a block-based random sampling algorithm is generally used in such an environment in consideration of sampling efficiency. However, random sampling of block granularity in a distributed environment such as MapReduce cannot guarantee the independent and same-distribution properties of samples. For each mapper (mapper), although the input data blocks are almost equal in size, data skew may still occur at the map end, for example, the distribution proportion of valid data satisfying the selection filtering predicate in each data block is different, so that the intermediate result sets generated by each mapper have different sizes, resulting in different processing times of the mappers. Furthermore, there is some correlation between the length of the mapper processing time and the size of the aggregate value of the data block. Those blocks that contain less valid data may take less time to complete processing at the map stage, arriving earlier at the reduce end, and there is a greater probability that such blocks contain a smaller aggregate value. At any time in the query processing process, the sample set is observed, wherein the probability of the occurrence of blocks with smaller aggregation values is higher, and the samples cannot be regarded as independent and identically distributed random variables, so that the unbiased property of the estimation algorithm is influenced, and the accuracy of the estimation result is influenced.
Disclosure of Invention
The invention provides a two-stage online sampling method based on a MapReduce model, which is used for eliminating bias influence on statistical estimation caused by data inclination in a distributed processing environment so as to ensure unbiased property and effectiveness of query estimation.
The invention relates to a two-stage online sampling method based on a MapReduce model, which comprises the following steps:
A. sampling in a first stage: when a MapReduce model receives input data of an upstream data node and initializes the input data, a whole group sampler is arranged before online processing is carried out on a map end (a mapping end), each data block is divided into one group, and sampling is carried out by taking the data block as a sample unit;
B. in the online query stage of the MapReduce model, the data blocks are used as observation units to construct variables
Figure DEST_PATH_IMAGE008
WhereinNIs the total number of data blocks in the data table,
Figure 638265DEST_PATH_IMAGE009
as a block of dataB i The aggregate result of all the tuples meeting the predicate condition is obtained, and the estimation value of the query result is
Figure DEST_PATH_IMAGE010
nFor the number of samples sampled, according to the central limit theorem, variablesY i Is approximated by an overall mean ofμVariance ofσ 2 Normal distribution of (2); at a given confidence level (1-α) Then, the confidence interval width is calculated as:
Figure 892529DEST_PATH_IMAGE011
confidence interval of the aggregated results is
Figure DEST_PATH_IMAGE012
Wherein, in the step (A),αthe level of significance is indicated as being the level of significance,
Figure 723213DEST_PATH_IMAGE013
representing a normal distributionαThe position of the branch point is divided into two parts,
Figure DEST_PATH_IMAGE014
is the sample variance;
C. and the second stage of sampling, before the reduction end (reduction end) starts to process, correcting the probability of each data block being sampled by the reduction end by an acceptance-rejection sampler: when the map end has a new output result, an Acceptance-Rejection Sampling (A/R Sampling) randomly generates a random number between 0 and 1uIf the random number isuIf the receiving probability is less than or equal to the receiving probability, receiving the output result of the map end and allowing the output result to carry out subsequent aggregation processing, otherwise, discarding the output result; the output result of the map end accepted by the acceptance-rejection sampler is random uniform samples based on blocks, wherein the data blocksB i Probability of acceptance of
Figure 404730DEST_PATH_IMAGE015
b i Representing blocks of dataB i The number of valid tuples contained in it,b maxrepresenting the maximum number of significant tuples contained in all data blocks,βreferred to as the adjustment factor, is,
Figure DEST_PATH_IMAGE016
D. and C, setting a recycle bin at the reduce end, entering the output result of the map end discarded by the acceptance-rejection sampler into the recycle bin for aggregation processing, and adding the aggregation processing result in the recycle bin to each Snapshot result (Snapshot) obtained in the aggregation processing of the accepted output result of the map end in the step C after the output result of the map end in the recycle bin is completely aggregated to obtain the actual result of the aggregation query. The core idea of the aggregation processing is that with the increasing proportion of the sampled data, the aggregation value is estimated after each sampling, and each estimation is a snapshot of the final aggregation result. The larger the proportion of the sampled data to the total data is, the more accurate the estimated aggregation result is, and the more convergent the estimation interval is. When the sampling progress does not reach 100%, the aggregation processing is carried out on the estimation of the aggregation result. Meanwhile, since the estimation can only be performed based on random samples, it is also the reason that a part of the output results of the map terminal needs to be discarded in step C. When the sampling progress reaches 100%, all data are processed in an aggregation mode, and a final accurate aggregation result can be calculated.
The whole group Sampling (Cluster Sampling) adopted by the whole group sampler is a Sampling mode that each unit in the whole group is grouped into a plurality of sets which are not crossed and repeated mutually, and the sets are called as groups (clusters), and then the groups are used as Sampling units to extract samples. When the whole group sampling is applied, each group is required to have better representativeness, namely, the difference of each unit in the group is large, and the difference between the groups is small.
In order to eliminate the bias problem caused by data inclination and MapReduce distributed processing, the invention ensures the effectiveness and unbiased property of a statistical algorithm by a method of accepting-rejecting sampling, so that the invention can process online gathering query of any inclined data in a MapReduce environment and can be widely applied to various distributed environments.
Further, in step a, the whole group of samplers maintains a data block random queue for each data table, one data block random queue contains data blocks corresponding to a plurality of data tables, each data block random queue corresponds to a mapper (mapper), the sequence of all the data blocks in the data block random queue is randomized, a mapper is assigned by the map end at each scheduling, and when an input data of the upstream data node is requested to be received, the mapper iteratively returns the data block at the head of the queue from the corresponding data block random queue.
Specifically, the aggregation in step B is calculated as:
the map function filters input tuples according to predicates selected by WHERE clauses in SQL query statements, and projects necessary attribute columns in a key-value pair mode, wherein a key value domain is the tuplestValue of the value field part equal to
Figure 219233DEST_PATH_IMAGE017
B2. The values belonging to the same group in a data block are accumulated by a combination function, the value range part of the output value of which comprises two values: the first value is
Figure DEST_PATH_IMAGE018
Obtained by local aggregation; the second value is
Figure DEST_PATH_IMAGE019
For estimating the result of the preprocessing, for calculating the variance of the samplesσ 2 (ii) a Calculating sample varianceσ 2 The method of (A) is a conventional method, and the general formula is as follows: variance = mean of squares-square of mean.
B3. All key-value pairs belonging to the same group are transferred to the same reducer.
Further, in step C, after a new output result exists at the map end, the value domain part of the key-value pair in the output result is the number of valid tuples contained in the corresponding data block counted by the combination function, the acceptance-rejection sampler reads the number of the valid tuples through the key-value pair, calculates the acceptance probability of the corresponding data block, and then generates a random numberu
On the basis, in step C, the acceptance probability of the data block without the valid tuple is set to be
Figure DEST_PATH_IMAGE020
Whereinb minRepresenting the minimum of the number of significant elements contained in all data blocks.
The two-stage online sampling method based on the MapReduce model ensures the randomness of samples on the premise of not increasing the network transmission cost through the online sampling of the two stages and the whole group sampling strategy taking a data block as a unit, provides effective statistical estimation based on the block granularity, and obviously eliminates the bias influence on the statistical estimation caused by the data inclination under the distributed processing environment, thereby ensuring the unbiased property and the effectiveness of the query estimation.
The present invention will be described in further detail with reference to the following examples. This should not be understood as limiting the scope of the above-described subject matter of the present invention to the following examples. Various substitutions and alterations according to the general knowledge and conventional practice in the art are intended to be included within the scope of the present invention without departing from the technical spirit of the present invention as described above.
Drawings
FIG. 1 is a flow chart of a two-stage online sampling method based on a MapReduce model.
Detailed Description
As shown in FIG. 1, the two-stage online sampling method based on the MapReduce model of the invention comprises the following steps:
A. sampling in a first stage: when the MapReduce model receives input data of an upstream data node and initializes the input data, a whole group sampler is arranged before online processing is carried out on a map end, each data block is divided into one group, and sampling is carried out by taking the data block as a sample unit. The whole group of samplers maintain a data block random queue for each data table, one data block random queue contains data blocks corresponding to a plurality of data tables, each data block random queue corresponds to a mapper (mapper), the sequence of all the data blocks in the data block random queue is randomized, a mapper is assigned by a map end during each scheduling, and when input data of an upstream data node is requested to be received, the mapper iteratively returns data blocks at the head of the queue from the corresponding data block random queue.
B. In the online query stage of the MapReduce model, the data blocks are used as observation units to construct variables
Figure DEST_PATH_IMAGE021
WhereinNIs the total number of data blocks in the data table,
Figure DEST_PATH_IMAGE022
as a block of dataB i The aggregate result of all the tuples meeting the predicate condition is obtained, and the estimation value of the query result is
Figure 903286DEST_PATH_IMAGE010
nFor the number of samples sampled, according to the central limit theorem, variablesY i Is approximated by an overall mean ofμVariance ofσ 2 Is normally distributed.
Wherein the aggregation is calculated as:
the map function filters input tuples according to predicates selected by WHERE clauses in SQL query statements, and projects necessary attribute columns in a key-value pair mode, wherein a key value domain is the tuplestValue of the value field part equal to
Figure 572165DEST_PATH_IMAGE023
B2. The values belonging to the same group in a data block are accumulated by a combination function, the value range part of the output value of which comprises two values: the first value is
Figure DEST_PATH_IMAGE024
Obtained by local aggregation; the second value is
Figure DEST_PATH_IMAGE025
For estimating the result of the preprocessing, for calculating the variance of the samplesσ 2
B3. All key-value pairs belonging to the same group are transferred to the same reducer.
Then at a given confidence level (1-α) Then, the confidence interval width is calculated as:
Figure DEST_PATH_IMAGE026
confidence interval of the aggregated results is
Figure 893425DEST_PATH_IMAGE027
Wherein, in the step (A),αthe level of significance is indicated as being the level of significance,
Figure DEST_PATH_IMAGE028
representing a normal distributionαThe position of the branch point is divided into two parts,
Figure 257541DEST_PATH_IMAGE029
is the sample variance.
C. In the MapReduce online query processing stage, due to the existence of data inclination, the probability that a data block with a small aggregation result reaches the reduce end is higher, and further the estimation result is lower. Therefore, in order to eliminate the bias effect caused by the data tilt, the second stage Sampling is required, and before the reduce end starts to process, the probability that each data block is drawn by the reduce end is corrected by an Acceptance-Rejection sampler (a/R Sampling): when there is a new output result at the map end, the value range part of the key-value pair in the output result is counted by the combination functionAnd corresponding to the number of the effective tuples contained in the data block, the acceptance-rejection sampler reads the number of the effective tuples through the key-value pair, and then calculates the acceptance probability of the corresponding data block. Then the acceptance-rejection sampler randomly generates a random number between 0 and 1uIf the random number isuAnd if the receiving probability is less than or equal to the receiving probability, receiving the output result of the map end and allowing the output result to carry out subsequent aggregation processing, otherwise, discarding the output result. Since the estimation in the aggregation process can only be performed based on random samples, a part of the output results at the map end needs to be discarded. Thus, the output result of the map end accepted by the acceptance-rejection sampler is a random uniform sample based on a block, wherein the data blockB i Probability of acceptance of
Figure DEST_PATH_IMAGE030
b i Representing blocks of dataB i The number of valid tuples contained in it,b maxrepresenting the maximum number of significant tuples contained in all data blocks,βreferred to as the adjustment factor, is,
Figure 260133DEST_PATH_IMAGE031
setting different adjustment factors according to the inclination degree of data in different scenesβTo avoid over-correction. In a severely skewed data set, the number of valid tuples contained in an individual data block is much larger than the mean of the valid tuples of the data block population, so that when the acceptance-rejection sampling is performed, the acceptance probability of most data blocks is a very small value, that is, most data blocks are discarded by the reduce end, and the estimation result cannot be updated later, that is, "overcorrection" occurs. To avoid overcorrection, adjustment factors are usedβTo the reception probabilityα i Adjusting, setting adjustment factors to different values according to the inclination degree of data in different application scenes, and ensuring the acceptance probability on one handα i Between 0 and 1, on the other hand, the phenomenon of 'over correction' can be eliminated, and the speed is ensured to be fastAnd (4) outputting the result stably.
In addition, for data blocks without valid tuples, the acceptance probability is set to
Figure DEST_PATH_IMAGE032
Whereinb minRepresenting the minimum of the number of significant elements contained in all data blocks.
D. And C, setting a recycle bin at the reduce end, entering the output result of the map end discarded by the acceptance-rejection sampler into the recycle bin for aggregation processing, and adding the aggregation processing result in the recycle bin to each Snapshot result (Snapshot) obtained in the aggregation processing of the accepted output result of the map end in the step C after the output result of the map end in the recycle bin is completely aggregated to obtain the actual result of the aggregation query.
The invention ensures the effectiveness and unbiasedness of a statistical algorithm by adopting a method of accepting-rejecting sampling in a two-stage online sampling mode, so that the method can process online aggregated query of any inclined data in a MapReduce environment, and can be widely applied to various distributed environments.

Claims (5)

1. The two-stage online sampling method based on the MapReduce model is characterized by comprising the following steps of:
A. sampling in a first stage: when a MapReduce model receives input data of an upstream data node and initializes the input data, a whole group sampler is arranged before online processing is carried out on a map end, each data block is divided into one group, and sampling is carried out by taking the data block as a sample unit;
B. in the online query stage of the MapReduce model, the data block is used as an observation unit to construct a variable Yi=N*expp(Bi) Where N is the total number of data blocks in the data table, expp(Bi) As a data block BiThe aggregate result of all the tuples meeting the predicate condition is obtained, and the estimation value of the query result is
Figure FDA0002392562220000011
n is a sample ofThis quantity, according to the central limit theorem, variable YiThe mean distribution of (A) is approximated by an overall mean of μ and a variance of σ2When the confidence coefficient (1- α) is given, the confidence interval width is calculated as follows:
Figure FDA0002392562220000012
confidence interval of the aggregated results is
Figure FDA0002392562220000013
Wherein α denotes the level of significance, zα/2The α quantile of a normal distribution is shown,
Figure FDA0002392562220000014
is the sample variance;
C. and the second stage of sampling corrects the probability of each data block being sampled by the reduce end through an acceptance-rejection sampler before the reduce end starts processing: when a new output result exists at the map end, randomly generating a random number u between 0 and 1 by the acceptance-rejection sampler, if the random number u is less than or equal to the acceptance probability, accepting the output result of the map end and allowing the output result to carry out subsequent aggregation processing, otherwise, discarding the output result; the output result of the map end accepted by the acceptance-rejection sampler is a random uniform sample based on a block, wherein the data block BiProbability of acceptance of
Figure FDA0002392562220000015
biRepresents a data block BiNumber of valid tuples contained in, bmaxThe maximum value of the number of the effective element groups contained in all the data blocks is represented by β and is called as an adjustment factor, wherein 0 is more than or equal to β is less than or equal to 1;
D. and C, setting a recycle bin at the reduce end, entering the output result of the map end discarded by the acceptance-rejection sampler into the recycle bin for aggregation processing, and adding the aggregation processing result in the recycle bin to each snapshot result obtained in the aggregation processing of the accepted output result of the map end in the step C to obtain the actual result of the aggregation query after the output results of the map end in the recycle bin are all aggregated.
2. The MapReduce model-based two-stage online sampling method of claim 1, wherein: in step A, the whole group of samplers maintains a data block random queue for each data table, one data block random queue contains data blocks corresponding to a plurality of data tables, each data block random queue corresponds to a mapper respectively, the sequence of all the data blocks in the data block random queue is randomized, one mapper is assigned by a map end during each scheduling, and when input data of an upstream data node is requested to be received, the mapper iteratively returns data blocks at the head of the queue from the corresponding data block random queue.
3. The MapReduce model-based two-stage online sampling method of claim 1, wherein: the aggregation in step B is calculated as:
the map function filters the input tuples according to predicates selected by the WHERE clause in the SQL query statement and projects necessary attribute columns in the form of key-value pairs, wherein the key value range is the grouped attribute value of the tuple t, and the value range part is equal to expp(t);
B2. The values belonging to the same group in a data block are accumulated by a combination function, the value range part of the output value of which comprises two values: the first value being expp(Bi) Obtained by local aggregation; the second value is expp(Bi)2For estimating the result of the preprocessing, for calculating the variance σ of the sample2
B3. All key-value pairs belonging to the same group are transferred to the same reducer.
4. The MapReduce model-based two-stage online sampling method of claim 1, wherein: in step C, when a new output result exists at the map end, the value domain part of the key-value pair in the output result is the number of effective tuples contained in the corresponding data block counted by the combination function, the acceptance-rejection sampler reads the number of the effective tuples through the key-value pair, calculates the acceptance probability of the corresponding data block and then generates a random number u.
5. A MapReduce model-based two-stage online sampling method as defined in any one of claims 1 to 4, wherein: in step C, the acceptance probability of the data block without the effective tuple is set to be
Figure FDA0002392562220000021
Wherein b isminRepresenting the minimum of the number of significant elements contained in all data blocks.
CN201911267526.5A 2019-12-11 2019-12-11 Two-stage online sampling method based on MapReduce model Active CN110704515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911267526.5A CN110704515B (en) 2019-12-11 2019-12-11 Two-stage online sampling method based on MapReduce model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911267526.5A CN110704515B (en) 2019-12-11 2019-12-11 Two-stage online sampling method based on MapReduce model

Publications (2)

Publication Number Publication Date
CN110704515A CN110704515A (en) 2020-01-17
CN110704515B true CN110704515B (en) 2020-06-02

Family

ID=69208090

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911267526.5A Active CN110704515B (en) 2019-12-11 2019-12-11 Two-stage online sampling method based on MapReduce model

Country Status (1)

Country Link
CN (1) CN110704515B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205359B (en) * 2021-04-27 2024-04-05 金蝶软件(中国)有限公司 Method and device for determining commodity price in bill and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799486A (en) * 2012-06-18 2012-11-28 北京大学 Data sampling and partitioning method for MapReduce system
CN104679590A (en) * 2013-11-27 2015-06-03 阿里巴巴集团控股有限公司 Map optimization method and device in distributive calculating system
CN106202172A (en) * 2016-06-24 2016-12-07 中国农业银行股份有限公司 Text compression methods and device
CN107066328A (en) * 2017-05-19 2017-08-18 成都四象联创科技有限公司 The construction method of large-scale data processing platform

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216683A (en) * 2013-05-31 2014-12-17 国际商业机器公司 Method and system for data processing through simultaneous multithreading (SMT)
CN103699696B (en) * 2014-01-13 2017-01-18 中国人民大学 Data online gathering method in cloud computing environment
CN104503844B (en) * 2014-12-29 2018-03-09 中国科学院深圳先进技术研究院 A kind of MapReduce operation fine grit classification methods based on multistage feature
CN106874367A (en) * 2016-12-30 2017-06-20 江苏号百信息服务有限公司 A kind of sampling distribution formula clustering method based on public sentiment platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799486A (en) * 2012-06-18 2012-11-28 北京大学 Data sampling and partitioning method for MapReduce system
CN104679590A (en) * 2013-11-27 2015-06-03 阿里巴巴集团控股有限公司 Map optimization method and device in distributive calculating system
CN106202172A (en) * 2016-06-24 2016-12-07 中国农业银行股份有限公司 Text compression methods and device
CN107066328A (en) * 2017-05-19 2017-08-18 成都四象联创科技有限公司 The construction method of large-scale data processing platform

Also Published As

Publication number Publication date
CN110704515A (en) 2020-01-17

Similar Documents

Publication Publication Date Title
US9152691B2 (en) System and method for performing set operations with defined sketch accuracy distribution
WO2016101628A1 (en) Data processing method and device in data modeling
Yun et al. Fastraq: A fast approach to range-aggregate queries in big data environments
CN106708989A (en) Spatial time sequence data stream application-based Skyline query method
EP3717997A1 (en) Cardinality estimation in databases
CN110704515B (en) Two-stage online sampling method based on MapReduce model
CN108733688B (en) Data analysis method and device
Kulessa et al. Model-based approximate query processing
Tran et al. Conditioning and aggregating uncertain data streams: Going beyond expectations
WO2022095661A1 (en) Update method and apparatus for recommendation model, computer device, and storage medium
CN107656995A (en) Towards the data management system of big data
WO2019184325A1 (en) Community division quality evaluation method and system based on average mutual information
CN109062949A (en) A kind of method of multi-table join search efficiency in raising Online aggregate
CN110597857B (en) Online aggregation method based on shared sample
CN111629216A (en) VOD service cache replacement method based on random forest algorithm under edge network environment
CN113641654B (en) Marketing treatment rule engine method based on real-time event
CN108256028B (en) Multi-dimensional dynamic sampling method for approximate query in cloud computing environment
Sheoran et al. A Step Toward Deep Online Aggregation (Extended Version)
WO2019153543A1 (en) Data dimension generation method, apparatus, device, and computer readable storage medium
CN103560921A (en) Method for merging network streaming data
CN112650770B (en) MySQL parameter recommendation method based on query work load analysis
Yang et al. A review of uncertain data stream clustering algorithms
CN110909019B (en) Big data duplicate checking method and device, computer equipment and storage medium
CN115827930B (en) Data query optimization method, system and device for graph database
WO2024041221A1 (en) Selection rate estimation method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant