CN106997303A

CN106997303A - Big data approximate evaluation method based on MapReduce

Info

Publication number: CN106997303A
Application number: CN201710230053.6A
Authority: CN
Inventors: 蔡志平; 孙文成
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2017-04-10
Filing date: 2017-04-10
Publication date: 2017-08-01
Anticipated expiration: 2037-04-10
Also published as: CN106997303B

Abstract

The present invention relates to the big data approximate evaluation method based on MapReduce.The approximate counting MapReduce programming models proposed, all items share counter, reduce memory space and communication cost.The less performance cost of the present invention, obtains the approximate processing result for having certain deviation, to improve MapReduce processing speed.The present invention calculates the confidence level and standard deviation of estimate on the basis of approximate count, and with the increase of free memory and available bandwidth, the degree of accuracy can be also improved.In addition, when given aimed at precision, it may be determined that required minimum memory headroom.

Description

Big data approximate evaluation method based on MapReduce

Technical field

The present invention relates to the approximate processing of big data, based on MapReduce programming models, less performance cost is used, is obtained There must be the approximate processing result of certain deviation, MapReduce processing speed is improved with this.

Background technology

Big data epoch, data have unprecedented total amount, mysterious formation speed and diversified number According to structure so that large-scale data analysis becomes increasingly to be full of challenge.In order to meet scalability requirement, mainly adopt at present Parallel processing is carried out to big data with the mode of server cluster.

Google companies, by delivering tri- papers of MapReduce, GFS, BigData, formally proposed in 2003 MapReduce programming models.In the server cluster handled towards big data, the simplicity that MapReduce is programmed due to it, Have become most popular framework.Even the developer without any Parallel and Distributed Systems programming experience, can also Easily go to solve each cumbersome task using distributed programmed.In addition, the ApacheHadoop of open source code is MapReduce is made that outstanding contributions in industrial circle and widely using for academia, and only Google companies just have more than 7000 Application be based on MapReduce exploitation.

MapReduce is widely deployed in large data sets the correlation rule between Mining Frequent Itemsets Based and frequent item set Relation.But, when data set is larger, memory headroom and communication cost required for MapReduce normal works are also It is very huge.

The present invention is exemplified by excavating electronic transaction data collection.If the transaction relationship total amount of electronic transaction data collection is 100 Hundred million, there are 5000 servers to carry out Map tasks, 200 servers carry out Reduce tasks, then each server is at least handled 10000000000 transaction relationship.Store each transaction relationship and take 60, memory counter takes 20, due to each server Maintain scale more than 10,000,000,000 counters, then the internal memory of each server is at least 100GB.Wherein store transaction is closed System is used up 75GB, and memory counter is used up 25GB.Like this, use up 500TB in the Map stages, is used up in the Reduce stages 20TB, use up 520TB altogether.Such memory requirement can cause great operating cost to server group.

So big memory requirement hardly results in satisfaction, and operating system and data processing software are also required to take storage Space.It is of course also possible to a most of transaction relationship and its counter are stored in during void deposits, to reduce memory headroom, but it is empty The random access protocol deposited can greatly reduce processing speed again, and constantly from void deposit acquisition data to increase IO again negative Load, therefore this is not effectively solution.

For many practical problems, particularly accurate result is sometimes and non-specifically is important.It is assumed that some is merchandised Average level be 100, when counting of this transaction significantly increases to thousands of, retailer will know that this transaction relationship Importance, can recommend as target, and it is 5000 or 5500 on earth to ignore this numeral.If likewise, numeral is very It is small to be recommended, it is 1 or 50 but regardless of it.

Based on this, the present invention proposes the MapReduce model approximately counted.Less performance cost is used, acquisition has one Determine the approximate processing result of deviation, MapReduce processing speed is improved with this.

The content of the invention

The present invention is on the basis of MapReduce, using approximate thought, uses less performance cost, and acquisition has certain The result of deviation, so as to improve MapReduce processing speed.Studied present invention is generally directed to extensive count.

For extensive counting, classical MapReduce processing thought is to set a counter to each, with Its key assignments is determined, the benefit so handled is that result is accurate, but memory headroom and communications cost expend too big.The present invention The MapReduce model (ACMP) of the approximate counting proposed, each l counter of correspondence, and each counter can be right Multiple are counted.But, the quantity of item is much larger than the quantity of counter, thus reduces occupancy of the counter to memory headroom And communication cost.

In Map stages, a total of s virtual machine, in Reduce stages, a total of t virtual machine.Present invention acquiescence s and t Selection must assure thatT is much larger than for integer and s.In the Map stages, virtual machine is divided into t groups, and every group hasIndividual virtual machine, point Not Dui Ying the Reduce stages a virtual machine, after the Map stages terminate, the virtual machine in Map stages can send out all numerical value Give corresponding Reduce virtual machines.

Map the and Reduce stages have a s+t platform virtual machines, what the configurations of all virtual machines was just as.I.e. every virtual Machine has m counter, and this m counter is divided into l groups, then, the quantity of every group of counter isIt is individual.Present invention acquiescence M and l selection must assure thatFor integer.I-th group of counter of the virtual machine is designated as M_i(0≤i<L), the jth in i-th group Individual counter is designated as

For any one k, k each appearance, VM can generate a random number r (0≤r<L), and perform Hash function, i.e.,GenerationAnd M_r[h] Jia 1.Because what r was randomly generated, therefore item K each appearance all can be by any one counter records in following l counter, i.e. M₀[H (k, 0)], M₁[H(k,1)]… M_l-1[H (k, l-1)], these counters are referred to as k representative counter.It should be noted that in any one virtual machine, The corresponding subscripts for representing counter of item k are all fixed.Counting for item k is only carried out as the representative counter corresponding to item k Count, and in any one virtual machine, the index of the representative counter corresponding to item k is identical, therefore it is different virtual The count value of machine can be added according to counter index.

After the Map stages terminate, the numerical value that it is recorded can be sent to corresponding by all virtual machines in Map Reduce virtual machines, the processing of next step is carried out by Reduce virtual machines.Every virtual machine has m counter, therefore has m meter Numerical value, is referred to as one group of count value by this m numerical value；By above, each Reduce virtual machine correspondenceIndividual Map virtual machines, therefore Every Reduce virtual machine can all be receivedGroup count value.Afterwards, subscript identical count value can be unified phase by Reduce virtual machines Plus, it is used as the value of the lower target counter of correspondence in Reduce virtual machines.So, every Reduce virtual machine can also generate m meter Group count value of numerical value, i.e., one.The t group count values in Reduce stages are designated as П.Then, by the t group count values in Reduce stages (i.e. П) is sent to main controlled node, and the count value of each item is calculated by main controlled node.

Appoint and take an item k, illustrate how main controlled node calculates the count value for obtaining k below.

Appoint the one group of Counter Value taken in П, be labeled as M.For M, it has m count value, and is divided into l groups, often Group hasIndividual count value.In addition, because the subscript of the representative counter corresponding to item k is fixed, the l for also having a k in M It is individual to represent counter, it is designated as M_i[H(k,i)](0≤i<l).The count value on k obtained by the present invention, is not the true of k Value, but k reasonable estimate.The present invention as by the M k recorded actual value, is counted with Ω (M, K) in addition, each representing Number device M_i[H (k, i)] can all generate an estimate, be designated as Ω i '.

M_i[H (k, i)] have recorded a k and other occurrence number.In the inner present invention stochastic variable Y of Mi [H (k, i)] To represent k occurrence number, other occurrence number is represented with stochastic variable Z.K each appearance is by counter M_i[H(k, I) probability] recorded isBecause there is l to represent counter.Therefore, Y obeys bi-distributionWhen When Ω (M, K) is sufficiently large,Can be by being just distributed very much expression, i.e.,：

Other Arbitrary Term k ' occurrence number is by Mi [H (k, the i)] probability recordedBecause m counter has Same chance records k '.Present invention n (M) is allowed as other all key number of times summation.Z obeys bi-distributionBecause Ω (M, K) is much smaller than n (M), therefore Z can equally meet Gaussian distributions, I.e.：

Stochastic variable X is taken, and makes X=Y+Z.Mi [H (k, i)] is a constant in X distributions.Because two independent Gaussian distribution additions are also Gaussian distributions, i.e. X is also Gaussian distributions, thus can be released (3), i.e.,：

Ω (M, K) estimate is obtained using this method, its accuracy is analyzed using variable later.Ω (M, K) reason It is as follows by being worth：

E (X) is theoretical value, thus can by actual observation to Mi [H (k, i)] value replace, thus can release estimating for Ω (M, K) Evaluation, is designated as Ω i ', i.e.,：

The average value of above-mentioned l theoretical value is taken, Ω (M, K) is designated as.It so can equally obtain being averaged for above-mentioned l estimate Value, will be designated as Ω (M, K) ', i.e.,：

Recall, k occurrence number is recorded by t counting group, and k total occurrence number is designated as into Ω (K), can be obtained：

Ω (K)=∑_M∈ПΩ(M,k) (7)

The estimate of the total occurrence numbers of k is designated as Ω (K) ', i.e.,：

Ω (K) '=∑s_M∈ПΩ(M,k)′ (8)

Because Mi [H (k, i)] (0 ＜＜ i<L) independent sample being distributed in (3) as X, so can be drawn by (5), Ω i ' (1≤i≤l) is the random independent variable that Gaussian shown below is distributed：

It is inner in (6), Ω (M, K) ' (1≤i≤l) distribution such as following formula：

Therefore, the estimate Ω (K) ' of the total occurrence numbers of k, meets following distribution：

(7) are applied to, the present invention has

Finally, the present invention obtains the average value and variance of estimate, specific as follows：

Therefore, when 1- α are confidence levels, existSo that confidential interval is：

Consider a random key k, the actual value recorded by counter all in П is Ω (M, K).Given confidence water Flat 1- α, then counting estimate Ω (M, K) ' should be in confidence level 1- α prescribed limit；

Therefore, the degree of accuracy should be：

Prob{Ω(M,K)′∈[Ω(M,K)-v,Ω(M,K)+v]}>1-α (11)

The target to be reached of the present invention is that on the premise of degree of accuracy desired value is met, the memory requirements of system is minimized.I.e. The minimum of counter number, is designated as m, and m is the quantity of each virtual machine Counter；

By equation (10), it can obtain：

By solving (12), the present invention can obtain standard value more lower number of than nonce counter, and the relation met is such as (13)：

Wherein, l represents the counter packet in virtual machine, and α represents confidence level, and v represents the standard deviation of count results, and c is represented Count threshold, element ∑_M∈ПN (M) refers to task W size, and m represents the quantity of each virtual machine Counter.

The method have the advantages that using less performance cost, the approximate processing result for having certain deviation is obtained, and can protect Card approximation error is controllable, and practicality is ensured while the processing speed for improving MapReduce.This be advantages of the present invention and Innovative point.

Brief description of the drawings

Fig. 1 is the structure chart of the MapReduce model of the approximate counting used in the present invention；

Fig. 2 is the distribution map of the virtual machine counter used in the present invention；

Fig. 3 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and standard error ν relation；

Fig. 4 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and count threshold c relation；

Fig. 5 is memory requirements (GB) and the relation of confidence level of resulting classics MP and ACMP in the present invention；

Fig. 6 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and tale TM relation；

Fig. 7 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and parameter ι relation.

Embodiment

The present invention is described in further details below in conjunction with accompanying drawing：

As shown in figure 1, the MapReduce model structure approximately counted includes two parts, i.e. Map stages and Reduce ranks Section.In the Map stages, task W is divided into s parts, transfers to Map virtual machines to handle.Map virtual machines handle it to being got for task Afterwards, form intermediate file, and intermediate file is sent to Reduce virtual machines.Map virtual machines have been divided into t groups, and every group has S/t virtual machine.Result after virtual machine in every group has handled subtask gathers again, according to same index phase Plus, obtain intermediate result.Now, the count value number in the count value number of intermediate result and any one Map virtual machine is The same.Reduce virtual machines have t, and each group of Map virtual machines one Reduce virtual machine of correspondence.Reduce virtual machines are received To after t group intermediate results, t group intermediate results are sent to main controlled node.For specific item count, main controlled node can be to institute The t group count values received are operated, and draw the counting estimate of item.

Fig. 2 is the counter distribution map of the virtual machine used in the present invention.As illustrated, every virtual machine has m Counter, this m counter is divided into l groups, then, the quantity of every group of counter isIt is individual.

Fig. 3 to Fig. 7 is a series of experiments that the present invention is carried out for the approximate memory requirements for counting MapReduce of checking. According to formula (13), ACMP memory consumption is with five relating to parameters, i.e. α, c, v, TM and l.Wherein, α refers to confidence level, and c refers to Count threshold, v refers to standard deviation, and TM refers to item collection size, and l refers to counter grouping number.It is assumed that the total degree that all key occur is 10000000000.In this case, if the present invention stores purchase relations with 60,20 memory counters, memory requirements is up toWherein, storage purchase relation takes 75GB, and memory counter takes 25GB.In the side of the present invention In case, just for memory headroom shared by counter is reduced, purchase relation can be stored in the buffer, and internal memory is uniformly placed on here In.

Fig. 3 is that the present invention compares the memory requirements of two schemes based on standard error ν.Confidence level is arranged on 95%, Count threshold is arranged on 10000.See the reduction with standard error, ACMP increases the demand of internal memory therewith.This means With the increase of the accuracy of counting, ACMP memory requirements is also in increase.Standard error is arranged on 2000, and memory requirements is 75.02GB；When standard error is arranged on 250, memory requirements increases to 79.08GB, and internal memory shared by counter improves 4 times.Together Classical MP is compared, when ν is brought up to 2000 from 250, and ACMP reduces 21.9% to 25.0%, meter to the demand of memory headroom Number device reduces 87.7% to 99.2% to the demand of memory headroom.Therefore, when less stringent to accuracy requirement, with warp Allusion quotation MP is compared, and ACMP can be greatly reduced internal memory and use.

Fig. 4 be the present invention based on support threshold values c come with comparing two schemes memory requirements.Confidence level is arranged on 95%, standard error is separately positioned on 500 and 1000.It can be seen that, with the increase of threshold values, standard error is arranged on 1000 The curve that curve ratio is arranged on 500 increases slow, and both gaps can be increasing, and exponentially level increases.Reason is when meter When number threshold values are very big, standard deviation is also very big, even if count threshold increase, the influence to result also will not be very big, therefore to internal memory Requirement will not also increase a lot.When support threshold values is identical, standard deviation is different, memory headroom needed for the small side of standard deviation It is larger.For example, when count threshold is 10,000, standard error internal memory for needed for 500 ACMP counters is that standard error is 4.6 times of 1000.Because the raising of the degree of accuracy needs more counters, more internal memories are consumed.

Fig. 5 is that the present invention compares two solutions based on confidence level.The present invention set gradually confidence level For 75%, 85%, 90%, 92.5%, 95%, 97.5% and 99%, count threshold is set to 10,000 and by standard deviation It is set to 500.As can be seen that with the raising of confidence level, ACMP also increases to the demand of internal memory with exponential.Therefore, It is feasible by the degree of accuracy of the memory size to improve estimate is improved when the degree of accuracy of counting is not high.

Fig. 6 is the memory requirements that the present invention compares two schemes according to item collection size.The present invention is by item collection size from 100 Hundred million to 5 hundred million are set gradually, and count threshold is arranged on into 10000 and standard error is arranged on into 500.As can be seen that with item The reduction of collection, two schemes are linearly reduced to the demand of internal memory.But, item collection size is determined by the size of database, it is not necessary to Adjustment.However, in the MapReduce approximately counted, average noise increases with the increase of item collection, therefore can be by increasing Plus memory space reduces noise.

Fig. 7 is the memory requirements that the present invention compares two schemes according to parameter ι.The present invention by parameter ι set gradually for 1024th, 2048,4096,8192,12288 and 16384, count threshold is set to 10,000 and standard error is set to 500.As can be seen that the memory headroom needed for ACMP is linearly increasing with ι increase.Therefore, parameter ι value should not be set Put excessive.

Claims

1. the big data approximate evaluation method based on MapReduce, the MapReduce model of the approximate counting of proposition, each Item l counter of correspondence, and each counter is counted to multiple, the quantity of item is much larger than the quantity of counter, by This reduces occupancy and communication cost of the counter to memory headroom, it is characterised in that when being counted to electronic transaction data, It is not one-to-one between item and counter, the number of counter is much smaller than the number of item；In the processing to intermediate file On, by being grouped to Map virtual machines, reduce communication cost；In terms of the degree of accuracy of count results, count results are not accurate True result, but its confidential interval and standard error are controllable；In memory headroom use, by reducing making for counter With, the use to memory headroom is greatly reduced,

In Map stages, a total of s virtual machine, in Reduce stages, a total of t virtual machine, present invention acquiescence s and t choosing Take and must assure thatT is much larger than for integer and s,

In the Map stages, virtual machine is divided into t groups, and every group hasIndividual virtual machine, corresponds to a virtual machine in Reduce stages respectively, After the Map stages terminate, all numerical value can be sent to corresponding Reduce virtual machines by the virtual machine in Map stages；

Map the and Reduce stages have a s+t platform virtual machines, what the configurations of all virtual machines was just as, i.e., every virtual machine is all There is m counter, this m counter is divided into l groups, then, the quantity of every group of counter isIt is individual, present invention acquiescence m and l Selection must assure thatFor integer, i-th group of counter of the virtual machine is designated as M_i, 0≤i<L, j-th of counting in i-th group Device is designated as M_i[j],

For any one k, k each appearance, VM can generate random number a r, 0≤r<L, and perform hash function, I.e.Generate h,And M_r[h] Jia 1, because what r was randomly generated, therefore item k each appearance Will be by any one counter records in l counter, i.e. M₀[H (k, 0)], M₁[H(k,1)]…M_l-1[H (k, l-1)], These counters are referred to as k representative counter, and the counting to item k is only counted as the representative counter corresponding to item k, and And in any one virtual machine, the index of the representative counter corresponding to item k is identical, therefore the counting of different virtual machine Value is added according to counter index.

2. the big data approximate evaluation method according to claim 1 based on MapReduce, it is characterised in that k counting Value is specifically calculated as follows：

Appoint and take one group of Counter Value in П, be labeled as M, for M, it has m count value, and is divided into l groups, and every group has Individual count value, in addition, because the subscript of the representative counter corresponding to item k is fixed, also there is the l representative of a k in M Counter, is designated as M_i[H(k,i)](0≤i<L), the count value on k obtained by the present invention, is not k actual value, and It is k reasonable estimate, the present invention uses Ω (M, K) as by the M k recorded actual value, in addition, each representing counter M_i [H (k, i)] can all generate an estimate, be designated as Ω i ',

M_i[H (k, i)] have recorded a k and other occurrence number, in Mi [H (k, i)] the inner present invention with stochastic variable Y come table Show k occurrence number, other occurrence number is represented with stochastic variable Z, k each appearance is by counter M_i[H (k, i)] remembers The probability recorded isBecause there is l to represent counter, therefore, Y obeys bi-distributionWhen Ω (M, K when) sufficiently large,Can be by being just distributed very much expression, i.e.,：

Y ~ N o r m (\frac{Ω (M, K)}{l}, \frac{Ω (M, K)}{l} (1 - \frac{1}{l})), - - - (1)

Other Arbitrary Term k ' occurrence number is by Mi [H (k, the i)] probability recordedBecause m counter has together The chance of sample records k ', allows present invention n (M) as other all key number of times summation, Z obeys bi-distributionBecause Ω (M, K) is much smaller than n (M), therefore Z can equally meet Gaussian distributions, I.e.：

Z ~ N o r m (\frac{n (M)}{m}, \frac{n (M)}{m} (1 - \frac{1}{m})), - - - (2)

Take stochastic variable X, and make X=Y+Z, Mi [H (k, i)] they be a constant in X distributions, because two independent Gaussian distribution additions are also Gaussian distributions, i.e. X is also Gaussian distributions, thus can be released (3), i.e.,：

X ~ N o r m (\frac{Ω (M, K)}{l} + \frac{n (M)}{m}, \frac{Ω (M, K)}{l} (1 - \frac{1}{l}) + \frac{n (M)}{m} (1 - \frac{1}{m})) - - - (3)

Ω (M, K) estimate is obtained using this method, its accuracy, Ω (M, K) reason is analyzed using variable later It is as follows by being worth：

E (X) = \frac{Ω (M, K)}{l} + \frac{n (M)}{m}, Ω (M, K) = (E (X) - \frac{n (M)}{m}) l - - - (4)

{Ωi}^{'} = (M i [H (k, i)] - \frac{n (M)}{m}) l, 1 \leq i \leq l - - - (5)

The average value of above-mentioned l theoretical value is taken, Ω (M, K) is designated as, then can equally obtain being averaged for above-mentioned l estimate Value, will be designated as Ω (M, K) ', i.e.,：

Ω {(M, K)}^{'} = \frac{Σ_{i = 0}^{l - 1} {Ωi}^{'}}{l} - - - (6)

K occurrence number is recorded by t counting group, and k total occurrence number is designated as into Ω (K), can be obtained：

Ω (K)=∑_M∈ПΩ(M,k) (7)

Ω (K) '=∑s_M∈ПΩ(M,k)′ (8)

Because Mi [H (k, i)] (0 ＜＜ i<L) independent sample being distributed in formula (3) as X, so can be drawn by (5), Ω i ' (1≤i≤l) are the random independent variable that Gaussian shown below is distributed：

{Ωi}^{'} ~ N o r m (Ω (M, K), Ω (M, K) (l - 1) + \frac{n (M) l^{2}}{m} (1 - \frac{1}{m}))

Ω {(M, K)}^{'} ~ N o r m (Ω (M, K), Ω (M, K) (1 - \frac{1}{l}) + \frac{n (M) l}{m} (1 - \frac{1}{m}))

Ω {(K)}^{'} ~ N o r m (\underset{M &Element; Π}{Σ} Ω (M, K), \underset{M &Element; Π}{Σ} (Ω (M, K) (1 - \frac{1}{l}) + \frac{n (M) l}{m} (1 - \frac{1}{m})))

Above formula is applied to formula (7), had

Ω {(K)}^{'} ~ N o r m (Ω (K), Ω (K) (1 - \frac{1}{l}) + Σ_{M &Element; Π} n (M) \frac{l}{m} (1 - \frac{1}{m}))

E (Ω {(K)}^{'}) = Ω (K), V a r (Ω {(K)}^{'}) = Ω (K) \frac{l - 1}{l} + Σ_{M &Element; Π} n (M) \frac{l}{m} (1 - \frac{1}{m}) - - - (9)

Ω (K) &PlusMinus; Z_{1 - \frac{α}{2}} \sqrt{Ω (K) \frac{l - 1}{l} + Σ_{M &Element; Π} n (M) \frac{l}{m} (1 - \frac{1}{m})} - - - (10)

Consider a random key k, the actual value recorded by counter all in П is Ω (M, K), gives confidence water Flat 1- α, then counting estimate Ω (M, K) ' should be in confidence level 1- α prescribed limit；

Therefore, the degree of accuracy should be：

Prob{Ω(M,K)′∈[Ω(M,K)-v,Ω(M,K)+v]}>1-α (11)

The target to be reached of the present invention is on the premise of degree of accuracy desired value is met, the memory requirements of system to be minimized, i.e., The minimum of counter number, is designated as m, and m is the quantity of each virtual machine Counter；

By equation (10), obtain：

Z_{1 - \frac{α}{2}} \sqrt{Ω (K) \frac{l - 1}{l} + Σ_{M &Element; Π} n (M) \frac{l}{m} (1 - \frac{1}{m})} < v - - - (12)

By formula (12), standard value more lower number of than nonce counter, the relation met such as formula (13) are obtained：

m > \frac{1 + \sqrt{1 - 4 w / l}}{2 w} l, w = \frac{{(\frac{v}{z_{1 - \frac{α}{2}}})}^{2} - c \frac{l - 1}{l}}{Σ_{M &Element; Π} n (M)} - - - (13)