CN106997303A - Big data approximate evaluation method based on MapReduce - Google Patents

Big data approximate evaluation method based on MapReduce Download PDF

Info

Publication number
CN106997303A
CN106997303A CN201710230053.6A CN201710230053A CN106997303A CN 106997303 A CN106997303 A CN 106997303A CN 201710230053 A CN201710230053 A CN 201710230053A CN 106997303 A CN106997303 A CN 106997303A
Authority
CN
China
Prior art keywords
counter
omega
value
virtual machine
designated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710230053.6A
Other languages
Chinese (zh)
Other versions
CN106997303B (en
Inventor
蔡志平
孙文成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201710230053.6A priority Critical patent/CN106997303B/en
Publication of CN106997303A publication Critical patent/CN106997303A/en
Application granted granted Critical
Publication of CN106997303B publication Critical patent/CN106997303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45583Memory management, e.g. access or allocation

Abstract

The present invention relates to the big data approximate evaluation method based on MapReduce.The approximate counting MapReduce programming models proposed, all items share counter, reduce memory space and communication cost.The less performance cost of the present invention, obtains the approximate processing result for having certain deviation, to improve MapReduce processing speed.The present invention calculates the confidence level and standard deviation of estimate on the basis of approximate count, and with the increase of free memory and available bandwidth, the degree of accuracy can be also improved.In addition, when given aimed at precision, it may be determined that required minimum memory headroom.

Description

Big data approximate evaluation method based on MapReduce
Technical field
The present invention relates to the approximate processing of big data, based on MapReduce programming models, less performance cost is used, is obtained There must be the approximate processing result of certain deviation, MapReduce processing speed is improved with this.
Background technology
Big data epoch, data have unprecedented total amount, mysterious formation speed and diversified number According to structure so that large-scale data analysis becomes increasingly to be full of challenge.In order to meet scalability requirement, mainly adopt at present Parallel processing is carried out to big data with the mode of server cluster.
Google companies, by delivering tri- papers of MapReduce, GFS, BigData, formally proposed in 2003 MapReduce programming models.In the server cluster handled towards big data, the simplicity that MapReduce is programmed due to it, Have become most popular framework.Even the developer without any Parallel and Distributed Systems programming experience, can also Easily go to solve each cumbersome task using distributed programmed.In addition, the ApacheHadoop of open source code is MapReduce is made that outstanding contributions in industrial circle and widely using for academia, and only Google companies just have more than 7000 Application be based on MapReduce exploitation.
MapReduce is widely deployed in large data sets the correlation rule between Mining Frequent Itemsets Based and frequent item set Relation.But, when data set is larger, memory headroom and communication cost required for MapReduce normal works are also It is very huge.
The present invention is exemplified by excavating electronic transaction data collection.If the transaction relationship total amount of electronic transaction data collection is 100 Hundred million, there are 5000 servers to carry out Map tasks, 200 servers carry out Reduce tasks, then each server is at least handled 10000000000 transaction relationship.Store each transaction relationship and take 60, memory counter takes 20, due to each server Maintain scale more than 10,000,000,000 counters, then the internal memory of each server is at least 100GB.Wherein store transaction is closed System is used up 75GB, and memory counter is used up 25GB.Like this, use up 500TB in the Map stages, is used up in the Reduce stages 20TB, use up 520TB altogether.Such memory requirement can cause great operating cost to server group.
So big memory requirement hardly results in satisfaction, and operating system and data processing software are also required to take storage Space.It is of course also possible to a most of transaction relationship and its counter are stored in during void deposits, to reduce memory headroom, but it is empty The random access protocol deposited can greatly reduce processing speed again, and constantly from void deposit acquisition data to increase IO again negative Load, therefore this is not effectively solution.
For many practical problems, particularly accurate result is sometimes and non-specifically is important.It is assumed that some is merchandised Average level be 100, when counting of this transaction significantly increases to thousands of, retailer will know that this transaction relationship Importance, can recommend as target, and it is 5000 or 5500 on earth to ignore this numeral.If likewise, numeral is very It is small to be recommended, it is 1 or 50 but regardless of it.
Based on this, the present invention proposes the MapReduce model approximately counted.Less performance cost is used, acquisition has one Determine the approximate processing result of deviation, MapReduce processing speed is improved with this.
The content of the invention
The present invention is on the basis of MapReduce, using approximate thought, uses less performance cost, and acquisition has certain The result of deviation, so as to improve MapReduce processing speed.Studied present invention is generally directed to extensive count.
For extensive counting, classical MapReduce processing thought is to set a counter to each, with Its key assignments is determined, the benefit so handled is that result is accurate, but memory headroom and communications cost expend too big.The present invention The MapReduce model (ACMP) of the approximate counting proposed, each l counter of correspondence, and each counter can be right Multiple are counted.But, the quantity of item is much larger than the quantity of counter, thus reduces occupancy of the counter to memory headroom And communication cost.
In Map stages, a total of s virtual machine, in Reduce stages, a total of t virtual machine.Present invention acquiescence s and t Selection must assure thatT is much larger than for integer and s.In the Map stages, virtual machine is divided into t groups, and every group hasIndividual virtual machine, point Not Dui Ying the Reduce stages a virtual machine, after the Map stages terminate, the virtual machine in Map stages can send out all numerical value Give corresponding Reduce virtual machines.
Map the and Reduce stages have a s+t platform virtual machines, what the configurations of all virtual machines was just as.I.e. every virtual Machine has m counter, and this m counter is divided into l groups, then, the quantity of every group of counter isIt is individual.Present invention acquiescence M and l selection must assure thatFor integer.I-th group of counter of the virtual machine is designated as Mi(0≤i<L), the jth in i-th group Individual counter is designated as
For any one k, k each appearance, VM can generate a random number r (0≤r<L), and perform Hash function, i.e.,GenerationAnd Mr[h] Jia 1.Because what r was randomly generated, therefore item K each appearance all can be by any one counter records in following l counter, i.e. M0[H (k, 0)], M1[H(k,1)]… Ml-1[H (k, l-1)], these counters are referred to as k representative counter.It should be noted that in any one virtual machine, The corresponding subscripts for representing counter of item k are all fixed.Counting for item k is only carried out as the representative counter corresponding to item k Count, and in any one virtual machine, the index of the representative counter corresponding to item k is identical, therefore it is different virtual The count value of machine can be added according to counter index.
After the Map stages terminate, the numerical value that it is recorded can be sent to corresponding by all virtual machines in Map Reduce virtual machines, the processing of next step is carried out by Reduce virtual machines.Every virtual machine has m counter, therefore has m meter Numerical value, is referred to as one group of count value by this m numerical value;By above, each Reduce virtual machine correspondenceIndividual Map virtual machines, therefore Every Reduce virtual machine can all be receivedGroup count value.Afterwards, subscript identical count value can be unified phase by Reduce virtual machines Plus, it is used as the value of the lower target counter of correspondence in Reduce virtual machines.So, every Reduce virtual machine can also generate m meter Group count value of numerical value, i.e., one.The t group count values in Reduce stages are designated as П.Then, by the t group count values in Reduce stages (i.e. П) is sent to main controlled node, and the count value of each item is calculated by main controlled node.
Appoint and take an item k, illustrate how main controlled node calculates the count value for obtaining k below.
Appoint the one group of Counter Value taken in П, be labeled as M.For M, it has m count value, and is divided into l groups, often Group hasIndividual count value.In addition, because the subscript of the representative counter corresponding to item k is fixed, the l for also having a k in M It is individual to represent counter, it is designated as Mi[H(k,i)](0≤i<l).The count value on k obtained by the present invention, is not the true of k Value, but k reasonable estimate.The present invention as by the M k recorded actual value, is counted with Ω (M, K) in addition, each representing Number device Mi[H (k, i)] can all generate an estimate, be designated as Ω i '.
Mi[H (k, i)] have recorded a k and other occurrence number.In the inner present invention stochastic variable Y of Mi [H (k, i)] To represent k occurrence number, other occurrence number is represented with stochastic variable Z.K each appearance is by counter Mi[H(k, I) probability] recorded isBecause there is l to represent counter.Therefore, Y obeys bi-distributionWhen When Ω (M, K) is sufficiently large,Can be by being just distributed very much expression, i.e.,:
Other Arbitrary Term k ' occurrence number is by Mi [H (k, the i)] probability recordedBecause m counter has Same chance records k '.Present invention n (M) is allowed as other all key number of times summation.Z obeys bi-distributionBecause Ω (M, K) is much smaller than n (M), therefore Z can equally meet Gaussian distributions, I.e.:
Stochastic variable X is taken, and makes X=Y+Z.Mi [H (k, i)] is a constant in X distributions.Because two independent Gaussian distribution additions are also Gaussian distributions, i.e. X is also Gaussian distributions, thus can be released (3), i.e.,:
Ω (M, K) estimate is obtained using this method, its accuracy is analyzed using variable later.Ω (M, K) reason It is as follows by being worth:
E (X) is theoretical value, thus can by actual observation to Mi [H (k, i)] value replace, thus can release estimating for Ω (M, K) Evaluation, is designated as Ω i ', i.e.,:
The average value of above-mentioned l theoretical value is taken, Ω (M, K) is designated as.It so can equally obtain being averaged for above-mentioned l estimate Value, will be designated as Ω (M, K) ', i.e.,:
Recall, k occurrence number is recorded by t counting group, and k total occurrence number is designated as into Ω (K), can be obtained:
Ω (K)=∑M∈ПΩ(M,k) (7)
The estimate of the total occurrence numbers of k is designated as Ω (K) ', i.e.,:
Ω (K) '=∑sM∈ПΩ(M,k)′ (8)
Because Mi [H (k, i)] (0 < < i<L) independent sample being distributed in (3) as X, so can be drawn by (5), Ω i ' (1≤i≤l) is the random independent variable that Gaussian shown below is distributed:
It is inner in (6), Ω (M, K) ' (1≤i≤l) distribution such as following formula:
Therefore, the estimate Ω (K) ' of the total occurrence numbers of k, meets following distribution:
(7) are applied to, the present invention has
Finally, the present invention obtains the average value and variance of estimate, specific as follows:
Therefore, when 1- α are confidence levels, existSo that confidential interval is:
Consider a random key k, the actual value recorded by counter all in П is Ω (M, K).Given confidence water Flat 1- α, then counting estimate Ω (M, K) ' should be in confidence level 1- α prescribed limit;
Therefore, the degree of accuracy should be:
Prob{Ω(M,K)′∈[Ω(M,K)-v,Ω(M,K)+v]}>1-α (11)
The target to be reached of the present invention is that on the premise of degree of accuracy desired value is met, the memory requirements of system is minimized.I.e. The minimum of counter number, is designated as m, and m is the quantity of each virtual machine Counter;
By equation (10), it can obtain:
By solving (12), the present invention can obtain standard value more lower number of than nonce counter, and the relation met is such as (13):
Wherein, l represents the counter packet in virtual machine, and α represents confidence level, and v represents the standard deviation of count results, and c is represented Count threshold, element ∑M∈ПN (M) refers to task W size, and m represents the quantity of each virtual machine Counter.
The method have the advantages that using less performance cost, the approximate processing result for having certain deviation is obtained, and can protect Card approximation error is controllable, and practicality is ensured while the processing speed for improving MapReduce.This be advantages of the present invention and Innovative point.
Brief description of the drawings
Fig. 1 is the structure chart of the MapReduce model of the approximate counting used in the present invention;
Fig. 2 is the distribution map of the virtual machine counter used in the present invention;
Fig. 3 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and standard error ν relation;
Fig. 4 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and count threshold c relation;
Fig. 5 is memory requirements (GB) and the relation of confidence level of resulting classics MP and ACMP in the present invention;
Fig. 6 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and tale TM relation;
Fig. 7 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and parameter ι relation.
Embodiment
The present invention is described in further details below in conjunction with accompanying drawing:
As shown in figure 1, the MapReduce model structure approximately counted includes two parts, i.e. Map stages and Reduce ranks Section.In the Map stages, task W is divided into s parts, transfers to Map virtual machines to handle.Map virtual machines handle it to being got for task Afterwards, form intermediate file, and intermediate file is sent to Reduce virtual machines.Map virtual machines have been divided into t groups, and every group has S/t virtual machine.Result after virtual machine in every group has handled subtask gathers again, according to same index phase Plus, obtain intermediate result.Now, the count value number in the count value number of intermediate result and any one Map virtual machine is The same.Reduce virtual machines have t, and each group of Map virtual machines one Reduce virtual machine of correspondence.Reduce virtual machines are received To after t group intermediate results, t group intermediate results are sent to main controlled node.For specific item count, main controlled node can be to institute The t group count values received are operated, and draw the counting estimate of item.
Fig. 2 is the counter distribution map of the virtual machine used in the present invention.As illustrated, every virtual machine has m Counter, this m counter is divided into l groups, then, the quantity of every group of counter isIt is individual.
Fig. 3 to Fig. 7 is a series of experiments that the present invention is carried out for the approximate memory requirements for counting MapReduce of checking. According to formula (13), ACMP memory consumption is with five relating to parameters, i.e. α, c, v, TM and l.Wherein, α refers to confidence level, and c refers to Count threshold, v refers to standard deviation, and TM refers to item collection size, and l refers to counter grouping number.It is assumed that the total degree that all key occur is 10000000000.In this case, if the present invention stores purchase relations with 60,20 memory counters, memory requirements is up toWherein, storage purchase relation takes 75GB, and memory counter takes 25GB.In the side of the present invention In case, just for memory headroom shared by counter is reduced, purchase relation can be stored in the buffer, and internal memory is uniformly placed on here In.
Fig. 3 is that the present invention compares the memory requirements of two schemes based on standard error ν.Confidence level is arranged on 95%, Count threshold is arranged on 10000.See the reduction with standard error, ACMP increases the demand of internal memory therewith.This means With the increase of the accuracy of counting, ACMP memory requirements is also in increase.Standard error is arranged on 2000, and memory requirements is 75.02GB;When standard error is arranged on 250, memory requirements increases to 79.08GB, and internal memory shared by counter improves 4 times.Together Classical MP is compared, when ν is brought up to 2000 from 250, and ACMP reduces 21.9% to 25.0%, meter to the demand of memory headroom Number device reduces 87.7% to 99.2% to the demand of memory headroom.Therefore, when less stringent to accuracy requirement, with warp Allusion quotation MP is compared, and ACMP can be greatly reduced internal memory and use.
Fig. 4 be the present invention based on support threshold values c come with comparing two schemes memory requirements.Confidence level is arranged on 95%, standard error is separately positioned on 500 and 1000.It can be seen that, with the increase of threshold values, standard error is arranged on 1000 The curve that curve ratio is arranged on 500 increases slow, and both gaps can be increasing, and exponentially level increases.Reason is when meter When number threshold values are very big, standard deviation is also very big, even if count threshold increase, the influence to result also will not be very big, therefore to internal memory Requirement will not also increase a lot.When support threshold values is identical, standard deviation is different, memory headroom needed for the small side of standard deviation It is larger.For example, when count threshold is 10,000, standard error internal memory for needed for 500 ACMP counters is that standard error is 4.6 times of 1000.Because the raising of the degree of accuracy needs more counters, more internal memories are consumed.
Fig. 5 is that the present invention compares two solutions based on confidence level.The present invention set gradually confidence level For 75%, 85%, 90%, 92.5%, 95%, 97.5% and 99%, count threshold is set to 10,000 and by standard deviation It is set to 500.As can be seen that with the raising of confidence level, ACMP also increases to the demand of internal memory with exponential.Therefore, It is feasible by the degree of accuracy of the memory size to improve estimate is improved when the degree of accuracy of counting is not high.
Fig. 6 is the memory requirements that the present invention compares two schemes according to item collection size.The present invention is by item collection size from 100 Hundred million to 5 hundred million are set gradually, and count threshold is arranged on into 10000 and standard error is arranged on into 500.As can be seen that with item The reduction of collection, two schemes are linearly reduced to the demand of internal memory.But, item collection size is determined by the size of database, it is not necessary to Adjustment.However, in the MapReduce approximately counted, average noise increases with the increase of item collection, therefore can be by increasing Plus memory space reduces noise.
Fig. 7 is the memory requirements that the present invention compares two schemes according to parameter ι.The present invention by parameter ι set gradually for 1024th, 2048,4096,8192,12288 and 16384, count threshold is set to 10,000 and standard error is set to 500.As can be seen that the memory headroom needed for ACMP is linearly increasing with ι increase.Therefore, parameter ι value should not be set Put excessive.

Claims (2)

1. the big data approximate evaluation method based on MapReduce, the MapReduce model of the approximate counting of proposition, each Item l counter of correspondence, and each counter is counted to multiple, the quantity of item is much larger than the quantity of counter, by This reduces occupancy and communication cost of the counter to memory headroom, it is characterised in that when being counted to electronic transaction data, It is not one-to-one between item and counter, the number of counter is much smaller than the number of item;In the processing to intermediate file On, by being grouped to Map virtual machines, reduce communication cost;In terms of the degree of accuracy of count results, count results are not accurate True result, but its confidential interval and standard error are controllable;In memory headroom use, by reducing making for counter With, the use to memory headroom is greatly reduced,
In Map stages, a total of s virtual machine, in Reduce stages, a total of t virtual machine, present invention acquiescence s and t choosing Take and must assure thatT is much larger than for integer and s,
In the Map stages, virtual machine is divided into t groups, and every group hasIndividual virtual machine, corresponds to a virtual machine in Reduce stages respectively, After the Map stages terminate, all numerical value can be sent to corresponding Reduce virtual machines by the virtual machine in Map stages;
Map the and Reduce stages have a s+t platform virtual machines, what the configurations of all virtual machines was just as, i.e., every virtual machine is all There is m counter, this m counter is divided into l groups, then, the quantity of every group of counter isIt is individual, present invention acquiescence m and l Selection must assure thatFor integer, i-th group of counter of the virtual machine is designated as Mi, 0≤i<L, j-th of counting in i-th group Device is designated as Mi[j],
For any one k, k each appearance, VM can generate random number a r, 0≤r<L, and perform hash function, I.e.Generate h,And Mr[h] Jia 1, because what r was randomly generated, therefore item k each appearance Will be by any one counter records in l counter, i.e. M0[H (k, 0)], M1[H(k,1)]…Ml-1[H (k, l-1)], These counters are referred to as k representative counter, and the counting to item k is only counted as the representative counter corresponding to item k, and And in any one virtual machine, the index of the representative counter corresponding to item k is identical, therefore the counting of different virtual machine Value is added according to counter index.
2. the big data approximate evaluation method according to claim 1 based on MapReduce, it is characterised in that k counting Value is specifically calculated as follows:
Appoint and take one group of Counter Value in П, be labeled as M, for M, it has m count value, and is divided into l groups, and every group has Individual count value, in addition, because the subscript of the representative counter corresponding to item k is fixed, also there is the l representative of a k in M Counter, is designated as Mi[H(k,i)](0≤i<L), the count value on k obtained by the present invention, is not k actual value, and It is k reasonable estimate, the present invention uses Ω (M, K) as by the M k recorded actual value, in addition, each representing counter Mi [H (k, i)] can all generate an estimate, be designated as Ω i ',
Mi[H (k, i)] have recorded a k and other occurrence number, in Mi [H (k, i)] the inner present invention with stochastic variable Y come table Show k occurrence number, other occurrence number is represented with stochastic variable Z, k each appearance is by counter Mi[H (k, i)] remembers The probability recorded isBecause there is l to represent counter, therefore, Y obeys bi-distributionWhen Ω (M, K when) sufficiently large,Can be by being just distributed very much expression, i.e.,:
Y ~ N o r m ( &Omega; ( M , K ) l , &Omega; ( M , K ) l ( 1 - 1 l ) ) , - - - ( 1 )
Other Arbitrary Term k ' occurrence number is by Mi [H (k, the i)] probability recordedBecause m counter has together The chance of sample records k ', allows present invention n (M) as other all key number of times summation, Z obeys bi-distributionBecause Ω (M, K) is much smaller than n (M), therefore Z can equally meet Gaussian distributions, I.e.:
Z ~ N o r m ( n ( M ) m , n ( M ) m ( 1 - 1 m ) ) , - - - ( 2 )
Take stochastic variable X, and make X=Y+Z, Mi [H (k, i)] they be a constant in X distributions, because two independent Gaussian distribution additions are also Gaussian distributions, i.e. X is also Gaussian distributions, thus can be released (3), i.e.,:
X ~ N o r m ( &Omega; ( M , K ) l + n ( M ) m , &Omega; ( M , K ) l ( 1 - 1 l ) + n ( M ) m ( 1 - 1 m ) ) - - - ( 3 )
Ω (M, K) estimate is obtained using this method, its accuracy, Ω (M, K) reason is analyzed using variable later It is as follows by being worth:
E ( X ) = &Omega; ( M , K ) l + n ( M ) m , &Omega; ( M , K ) = ( E ( X ) - n ( M ) m ) l - - - ( 4 )
E (X) is theoretical value, thus can by actual observation to Mi [H (k, i)] value replace, thus can release estimating for Ω (M, K) Evaluation, is designated as Ω i ', i.e.,:
&Omega;i &prime; = ( M i &lsqb; H ( k , i ) &rsqb; - n ( M ) m ) l , 1 &le; i &le; l - - - ( 5 )
The average value of above-mentioned l theoretical value is taken, Ω (M, K) is designated as, then can equally obtain being averaged for above-mentioned l estimate Value, will be designated as Ω (M, K) ', i.e.,:
&Omega; ( M , K ) &prime; = &Sigma; i = 0 l - 1 &Omega;i &prime; l - - - ( 6 )
K occurrence number is recorded by t counting group, and k total occurrence number is designated as into Ω (K), can be obtained:
Ω (K)=∑M∈ПΩ(M,k) (7)
The estimate of the total occurrence numbers of k is designated as Ω (K) ', i.e.,:
Ω (K) '=∑sM∈ПΩ(M,k)′ (8)
Because Mi [H (k, i)] (0 < < i<L) independent sample being distributed in formula (3) as X, so can be drawn by (5), Ω i ' (1≤i≤l) are the random independent variable that Gaussian shown below is distributed:
&Omega;i &prime; ~ N o r m ( &Omega; ( M , K ) , &Omega; ( M , K ) ( l - 1 ) + n ( M ) l 2 m ( 1 - 1 m ) )
It is inner in (6), Ω (M, K) ' (1≤i≤l) distribution such as following formula:
&Omega; ( M , K ) &prime; ~ N o r m ( &Omega; ( M , K ) , &Omega; ( M , K ) ( 1 - 1 l ) + n ( M ) l m ( 1 - 1 m ) )
Therefore, the estimate Ω (K) ' of the total occurrence numbers of k, meets following distribution:
&Omega; ( K ) &prime; ~ N o r m ( &Sigma; M &Element; &Pi; &Omega; ( M , K ) , &Sigma; M &Element; &Pi; ( &Omega; ( M , K ) ( 1 - 1 l ) + n ( M ) l m ( 1 - 1 m ) ) )
Above formula is applied to formula (7), had
&Omega; ( K ) &prime; ~ N o r m ( &Omega; ( K ) , &Omega; ( K ) ( 1 - 1 l ) + &Sigma; M &Element; &Pi; n ( M ) l m ( 1 - 1 m ) )
Finally, the present invention obtains the average value and variance of estimate, specific as follows:
E ( &Omega; ( K ) &prime; ) = &Omega; ( K ) , V a r ( &Omega; ( K ) &prime; ) = &Omega; ( K ) l - 1 l + &Sigma; M &Element; &Pi; n ( M ) l m ( 1 - 1 m ) - - - ( 9 )
Therefore, when 1- α are confidence levels, existSo that confidential interval is:
&Omega; ( K ) &PlusMinus; Z 1 - &alpha; 2 &Omega; ( K ) l - 1 l + &Sigma; M &Element; &Pi; n ( M ) l m ( 1 - 1 m ) - - - ( 10 )
Consider a random key k, the actual value recorded by counter all in П is Ω (M, K), gives confidence water Flat 1- α, then counting estimate Ω (M, K) ' should be in confidence level 1- α prescribed limit;
Therefore, the degree of accuracy should be:
Prob{Ω(M,K)′∈[Ω(M,K)-v,Ω(M,K)+v]}>1-α (11)
The target to be reached of the present invention is on the premise of degree of accuracy desired value is met, the memory requirements of system to be minimized, i.e., The minimum of counter number, is designated as m, and m is the quantity of each virtual machine Counter;
By equation (10), obtain:
Z 1 - &alpha; 2 &Omega; ( K ) l - 1 l + &Sigma; M &Element; &Pi; n ( M ) l m ( 1 - 1 m ) < v - - - ( 12 )
By formula (12), standard value more lower number of than nonce counter, the relation met such as formula (13) are obtained:
m > 1 + 1 - 4 w / l 2 w l , w = ( v z 1 - &alpha; 2 ) 2 - c l - 1 l &Sigma; M &Element; &Pi; n ( M ) - - - ( 13 )
Wherein, l represents the counter packet in virtual machine, and α represents confidence level, and v represents the standard deviation of count results, and c is represented Count threshold, element ∑M∈ПN (M) refers to task W size, and m represents the quantity of each virtual machine Counter.
CN201710230053.6A 2017-04-10 2017-04-10 MapReduce-based big data approximate processing method Active CN106997303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710230053.6A CN106997303B (en) 2017-04-10 2017-04-10 MapReduce-based big data approximate processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710230053.6A CN106997303B (en) 2017-04-10 2017-04-10 MapReduce-based big data approximate processing method

Publications (2)

Publication Number Publication Date
CN106997303A true CN106997303A (en) 2017-08-01
CN106997303B CN106997303B (en) 2020-07-17

Family

ID=59435113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710230053.6A Active CN106997303B (en) 2017-04-10 2017-04-10 MapReduce-based big data approximate processing method

Country Status (1)

Country Link
CN (1) CN106997303B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436954A (en) * 2017-08-16 2017-12-05 吉林大学 A kind of online flow data approximate processing method of quality control and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609347A (en) * 2012-02-17 2012-07-25 江苏南开之星软件技术有限公司 Method for detecting load hotspots in virtual environment
US20120311581A1 (en) * 2011-05-31 2012-12-06 International Business Machines Corporation Adaptive parallel data processing
US20130185729A1 (en) * 2012-01-13 2013-07-18 Rutgers, The State University Of New Jersey Accelerating resource allocation in virtualized environments using workload classes and/or workload signatures
CN103544258A (en) * 2013-10-16 2014-01-29 国家计算机网络与信息安全管理中心 Cardinal number estimating method and cardinal number estimating device under multi-section query condition of big data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120311581A1 (en) * 2011-05-31 2012-12-06 International Business Machines Corporation Adaptive parallel data processing
US20130185729A1 (en) * 2012-01-13 2013-07-18 Rutgers, The State University Of New Jersey Accelerating resource allocation in virtualized environments using workload classes and/or workload signatures
CN102609347A (en) * 2012-02-17 2012-07-25 江苏南开之星软件技术有限公司 Method for detecting load hotspots in virtual environment
CN103544258A (en) * 2013-10-16 2014-01-29 国家计算机网络与信息安全管理中心 Cardinal number estimating method and cardinal number estimating device under multi-section query condition of big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEI YAN 等: ""Scalable Load Balancing for MapReduce-based Record Linkage"", 《2013 IEEE 32ND INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE(IPCCC)》 *
童维勤 等: "《数据密集型计算和模型》", 31 January 2015 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436954A (en) * 2017-08-16 2017-12-05 吉林大学 A kind of online flow data approximate processing method of quality control and device

Also Published As

Publication number Publication date
CN106997303B (en) 2020-07-17

Similar Documents

Publication Publication Date Title
US11853283B2 (en) Dynamic aggregate generation and updating for high performance querying of large datasets
Kalodner et al. {BlockSci}: Design and applications of a blockchain analysis platform
CN102915347B (en) A kind of distributed traffic clustering method and system
Lee et al. Large-scale incremental processing with MapReduce
CN107766402A (en) A kind of building dictionary cloud source of houses big data platform
US20160217193A1 (en) Database and method for evaluating data therefrom
CN107229995A (en) Realize method, device and computer-readable recording medium that game service amount is estimated
US10929361B2 (en) Rule-based data source selection
Amossen Vertical partitioning of relational OLTP databases using integer programming
CN109408590A (en) Expansion method, device, equipment and the storage medium of distributed data base
CN105117442A (en) Probability based big data query method
CN108205713A (en) A kind of region wind power prediction error distribution determination method and device
CN108416054A (en) Dynamic HDFS copy number calculating methods based on file access temperature
CN105677645A (en) Data sheet comparison method and device
CN106034144A (en) Load-balancing-based virtual asset data storage method
CN106997303A (en) Big data approximate evaluation method based on MapReduce
Wang et al. Gradient scheduling with global momentum for asynchronous federated learning in edge environment
CN110580307B (en) Processing method and device for fast statistics
CN104778088A (en) Method and system for optimizing parallel I/O (input/output) by reducing inter-progress communication expense
CN105631047A (en) Hierarchically-cascaded data processing method and hierarchically-cascaded data processing system
Yang et al. Probabilistic modeling of renewable energy source based on Spark platform with large‐scale sample data
Zhang et al. An optimization algorithm applied to the class integration and test order problem
CN110489460B (en) Optimization method and system for rapid statistics
CN108520053B (en) Big data query method based on data distribution
Venkateswaran et al. Using machine learning for intelligent shard sizing on the cloud

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant