CN106997303A - Big data approximate evaluation method based on MapReduce - Google Patents
Big data approximate evaluation method based on MapReduce Download PDFInfo
- Publication number
- CN106997303A CN106997303A CN201710230053.6A CN201710230053A CN106997303A CN 106997303 A CN106997303 A CN 106997303A CN 201710230053 A CN201710230053 A CN 201710230053A CN 106997303 A CN106997303 A CN 106997303A
- Authority
- CN
- China
- Prior art keywords
- counter
- omega
- value
- virtual machine
- designated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45583—Memory management, e.g. access or allocation
Abstract
The present invention relates to the big data approximate evaluation method based on MapReduce.The approximate counting MapReduce programming models proposed, all items share counter, reduce memory space and communication cost.The less performance cost of the present invention, obtains the approximate processing result for having certain deviation, to improve MapReduce processing speed.The present invention calculates the confidence level and standard deviation of estimate on the basis of approximate count, and with the increase of free memory and available bandwidth, the degree of accuracy can be also improved.In addition, when given aimed at precision, it may be determined that required minimum memory headroom.
Description
Technical field
The present invention relates to the approximate processing of big data, based on MapReduce programming models, less performance cost is used, is obtained
There must be the approximate processing result of certain deviation, MapReduce processing speed is improved with this.
Background technology
Big data epoch, data have unprecedented total amount, mysterious formation speed and diversified number
According to structure so that large-scale data analysis becomes increasingly to be full of challenge.In order to meet scalability requirement, mainly adopt at present
Parallel processing is carried out to big data with the mode of server cluster.
Google companies, by delivering tri- papers of MapReduce, GFS, BigData, formally proposed in 2003
MapReduce programming models.In the server cluster handled towards big data, the simplicity that MapReduce is programmed due to it,
Have become most popular framework.Even the developer without any Parallel and Distributed Systems programming experience, can also
Easily go to solve each cumbersome task using distributed programmed.In addition, the ApacheHadoop of open source code is
MapReduce is made that outstanding contributions in industrial circle and widely using for academia, and only Google companies just have more than 7000
Application be based on MapReduce exploitation.
MapReduce is widely deployed in large data sets the correlation rule between Mining Frequent Itemsets Based and frequent item set
Relation.But, when data set is larger, memory headroom and communication cost required for MapReduce normal works are also
It is very huge.
The present invention is exemplified by excavating electronic transaction data collection.If the transaction relationship total amount of electronic transaction data collection is 100
Hundred million, there are 5000 servers to carry out Map tasks, 200 servers carry out Reduce tasks, then each server is at least handled
10000000000 transaction relationship.Store each transaction relationship and take 60, memory counter takes 20, due to each server
Maintain scale more than 10,000,000,000 counters, then the internal memory of each server is at least 100GB.Wherein store transaction is closed
System is used up 75GB, and memory counter is used up 25GB.Like this, use up 500TB in the Map stages, is used up in the Reduce stages
20TB, use up 520TB altogether.Such memory requirement can cause great operating cost to server group.
So big memory requirement hardly results in satisfaction, and operating system and data processing software are also required to take storage
Space.It is of course also possible to a most of transaction relationship and its counter are stored in during void deposits, to reduce memory headroom, but it is empty
The random access protocol deposited can greatly reduce processing speed again, and constantly from void deposit acquisition data to increase IO again negative
Load, therefore this is not effectively solution.
For many practical problems, particularly accurate result is sometimes and non-specifically is important.It is assumed that some is merchandised
Average level be 100, when counting of this transaction significantly increases to thousands of, retailer will know that this transaction relationship
Importance, can recommend as target, and it is 5000 or 5500 on earth to ignore this numeral.If likewise, numeral is very
It is small to be recommended, it is 1 or 50 but regardless of it.
Based on this, the present invention proposes the MapReduce model approximately counted.Less performance cost is used, acquisition has one
Determine the approximate processing result of deviation, MapReduce processing speed is improved with this.
The content of the invention
The present invention is on the basis of MapReduce, using approximate thought, uses less performance cost, and acquisition has certain
The result of deviation, so as to improve MapReduce processing speed.Studied present invention is generally directed to extensive count.
For extensive counting, classical MapReduce processing thought is to set a counter to each, with
Its key assignments is determined, the benefit so handled is that result is accurate, but memory headroom and communications cost expend too big.The present invention
The MapReduce model (ACMP) of the approximate counting proposed, each l counter of correspondence, and each counter can be right
Multiple are counted.But, the quantity of item is much larger than the quantity of counter, thus reduces occupancy of the counter to memory headroom
And communication cost.
In Map stages, a total of s virtual machine, in Reduce stages, a total of t virtual machine.Present invention acquiescence s and t
Selection must assure thatT is much larger than for integer and s.In the Map stages, virtual machine is divided into t groups, and every group hasIndividual virtual machine, point
Not Dui Ying the Reduce stages a virtual machine, after the Map stages terminate, the virtual machine in Map stages can send out all numerical value
Give corresponding Reduce virtual machines.
Map the and Reduce stages have a s+t platform virtual machines, what the configurations of all virtual machines was just as.I.e. every virtual
Machine has m counter, and this m counter is divided into l groups, then, the quantity of every group of counter isIt is individual.Present invention acquiescence
M and l selection must assure thatFor integer.I-th group of counter of the virtual machine is designated as Mi(0≤i<L), the jth in i-th group
Individual counter is designated as
For any one k, k each appearance, VM can generate a random number r (0≤r<L), and perform
Hash function, i.e.,GenerationAnd Mr[h] Jia 1.Because what r was randomly generated, therefore item
K each appearance all can be by any one counter records in following l counter, i.e. M0[H (k, 0)], M1[H(k,1)]…
Ml-1[H (k, l-1)], these counters are referred to as k representative counter.It should be noted that in any one virtual machine,
The corresponding subscripts for representing counter of item k are all fixed.Counting for item k is only carried out as the representative counter corresponding to item k
Count, and in any one virtual machine, the index of the representative counter corresponding to item k is identical, therefore it is different virtual
The count value of machine can be added according to counter index.
After the Map stages terminate, the numerical value that it is recorded can be sent to corresponding by all virtual machines in Map
Reduce virtual machines, the processing of next step is carried out by Reduce virtual machines.Every virtual machine has m counter, therefore has m meter
Numerical value, is referred to as one group of count value by this m numerical value;By above, each Reduce virtual machine correspondenceIndividual Map virtual machines, therefore
Every Reduce virtual machine can all be receivedGroup count value.Afterwards, subscript identical count value can be unified phase by Reduce virtual machines
Plus, it is used as the value of the lower target counter of correspondence in Reduce virtual machines.So, every Reduce virtual machine can also generate m meter
Group count value of numerical value, i.e., one.The t group count values in Reduce stages are designated as П.Then, by the t group count values in Reduce stages
(i.e. П) is sent to main controlled node, and the count value of each item is calculated by main controlled node.
Appoint and take an item k, illustrate how main controlled node calculates the count value for obtaining k below.
Appoint the one group of Counter Value taken in П, be labeled as M.For M, it has m count value, and is divided into l groups, often
Group hasIndividual count value.In addition, because the subscript of the representative counter corresponding to item k is fixed, the l for also having a k in M
It is individual to represent counter, it is designated as Mi[H(k,i)](0≤i<l).The count value on k obtained by the present invention, is not the true of k
Value, but k reasonable estimate.The present invention as by the M k recorded actual value, is counted with Ω (M, K) in addition, each representing
Number device Mi[H (k, i)] can all generate an estimate, be designated as Ω i '.
Mi[H (k, i)] have recorded a k and other occurrence number.In the inner present invention stochastic variable Y of Mi [H (k, i)]
To represent k occurrence number, other occurrence number is represented with stochastic variable Z.K each appearance is by counter Mi[H(k,
I) probability] recorded isBecause there is l to represent counter.Therefore, Y obeys bi-distributionWhen
When Ω (M, K) is sufficiently large,Can be by being just distributed very much expression, i.e.,:
Other Arbitrary Term k ' occurrence number is by Mi [H (k, the i)] probability recordedBecause m counter has
Same chance records k '.Present invention n (M) is allowed as other all key number of times summation.Z obeys bi-distributionBecause Ω (M, K) is much smaller than n (M), therefore Z can equally meet Gaussian distributions,
I.e.:
Stochastic variable X is taken, and makes X=Y+Z.Mi [H (k, i)] is a constant in X distributions.Because two independent
Gaussian distribution additions are also Gaussian distributions, i.e. X is also Gaussian distributions, thus can be released (3), i.e.,:
Ω (M, K) estimate is obtained using this method, its accuracy is analyzed using variable later.Ω (M, K) reason
It is as follows by being worth:
E (X) is theoretical value, thus can by actual observation to Mi [H (k, i)] value replace, thus can release estimating for Ω (M, K)
Evaluation, is designated as Ω i ', i.e.,:
The average value of above-mentioned l theoretical value is taken, Ω (M, K) is designated as.It so can equally obtain being averaged for above-mentioned l estimate
Value, will be designated as Ω (M, K) ', i.e.,:
Recall, k occurrence number is recorded by t counting group, and k total occurrence number is designated as into Ω (K), can be obtained:
Ω (K)=∑M∈ПΩ(M,k) (7)
The estimate of the total occurrence numbers of k is designated as Ω (K) ', i.e.,:
Ω (K) '=∑sM∈ПΩ(M,k)′ (8)
Because Mi [H (k, i)] (0 < < i<L) independent sample being distributed in (3) as X, so can be drawn by (5), Ω i '
(1≤i≤l) is the random independent variable that Gaussian shown below is distributed:
It is inner in (6), Ω (M, K) ' (1≤i≤l) distribution such as following formula:
Therefore, the estimate Ω (K) ' of the total occurrence numbers of k, meets following distribution:
(7) are applied to, the present invention has
Finally, the present invention obtains the average value and variance of estimate, specific as follows:
Therefore, when 1- α are confidence levels, existSo that confidential interval is:
Consider a random key k, the actual value recorded by counter all in П is Ω (M, K).Given confidence water
Flat 1- α, then counting estimate Ω (M, K) ' should be in confidence level 1- α prescribed limit;
Therefore, the degree of accuracy should be:
Prob{Ω(M,K)′∈[Ω(M,K)-v,Ω(M,K)+v]}>1-α (11)
The target to be reached of the present invention is that on the premise of degree of accuracy desired value is met, the memory requirements of system is minimized.I.e.
The minimum of counter number, is designated as m, and m is the quantity of each virtual machine Counter;
By equation (10), it can obtain:
By solving (12), the present invention can obtain standard value more lower number of than nonce counter, and the relation met is such as
(13):
Wherein, l represents the counter packet in virtual machine, and α represents confidence level, and v represents the standard deviation of count results, and c is represented
Count threshold, element ∑M∈ПN (M) refers to task W size, and m represents the quantity of each virtual machine Counter.
The method have the advantages that using less performance cost, the approximate processing result for having certain deviation is obtained, and can protect
Card approximation error is controllable, and practicality is ensured while the processing speed for improving MapReduce.This be advantages of the present invention and
Innovative point.
Brief description of the drawings
Fig. 1 is the structure chart of the MapReduce model of the approximate counting used in the present invention;
Fig. 2 is the distribution map of the virtual machine counter used in the present invention;
Fig. 3 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and standard error ν relation;
Fig. 4 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and count threshold c relation;
Fig. 5 is memory requirements (GB) and the relation of confidence level of resulting classics MP and ACMP in the present invention;
Fig. 6 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and tale TM relation;
Fig. 7 be in the present invention obtained by classical MP and ACMP memory requirements (GB) and parameter ι relation.
Embodiment
The present invention is described in further details below in conjunction with accompanying drawing:
As shown in figure 1, the MapReduce model structure approximately counted includes two parts, i.e. Map stages and Reduce ranks
Section.In the Map stages, task W is divided into s parts, transfers to Map virtual machines to handle.Map virtual machines handle it to being got for task
Afterwards, form intermediate file, and intermediate file is sent to Reduce virtual machines.Map virtual machines have been divided into t groups, and every group has
S/t virtual machine.Result after virtual machine in every group has handled subtask gathers again, according to same index phase
Plus, obtain intermediate result.Now, the count value number in the count value number of intermediate result and any one Map virtual machine is
The same.Reduce virtual machines have t, and each group of Map virtual machines one Reduce virtual machine of correspondence.Reduce virtual machines are received
To after t group intermediate results, t group intermediate results are sent to main controlled node.For specific item count, main controlled node can be to institute
The t group count values received are operated, and draw the counting estimate of item.
Fig. 2 is the counter distribution map of the virtual machine used in the present invention.As illustrated, every virtual machine has m
Counter, this m counter is divided into l groups, then, the quantity of every group of counter isIt is individual.
Fig. 3 to Fig. 7 is a series of experiments that the present invention is carried out for the approximate memory requirements for counting MapReduce of checking.
According to formula (13), ACMP memory consumption is with five relating to parameters, i.e. α, c, v, TM and l.Wherein, α refers to confidence level, and c refers to
Count threshold, v refers to standard deviation, and TM refers to item collection size, and l refers to counter grouping number.It is assumed that the total degree that all key occur is
10000000000.In this case, if the present invention stores purchase relations with 60,20 memory counters, memory requirements is up toWherein, storage purchase relation takes 75GB, and memory counter takes 25GB.In the side of the present invention
In case, just for memory headroom shared by counter is reduced, purchase relation can be stored in the buffer, and internal memory is uniformly placed on here
In.
Fig. 3 is that the present invention compares the memory requirements of two schemes based on standard error ν.Confidence level is arranged on 95%,
Count threshold is arranged on 10000.See the reduction with standard error, ACMP increases the demand of internal memory therewith.This means
With the increase of the accuracy of counting, ACMP memory requirements is also in increase.Standard error is arranged on 2000, and memory requirements is
75.02GB;When standard error is arranged on 250, memory requirements increases to 79.08GB, and internal memory shared by counter improves 4 times.Together
Classical MP is compared, when ν is brought up to 2000 from 250, and ACMP reduces 21.9% to 25.0%, meter to the demand of memory headroom
Number device reduces 87.7% to 99.2% to the demand of memory headroom.Therefore, when less stringent to accuracy requirement, with warp
Allusion quotation MP is compared, and ACMP can be greatly reduced internal memory and use.
Fig. 4 be the present invention based on support threshold values c come with comparing two schemes memory requirements.Confidence level is arranged on
95%, standard error is separately positioned on 500 and 1000.It can be seen that, with the increase of threshold values, standard error is arranged on 1000
The curve that curve ratio is arranged on 500 increases slow, and both gaps can be increasing, and exponentially level increases.Reason is when meter
When number threshold values are very big, standard deviation is also very big, even if count threshold increase, the influence to result also will not be very big, therefore to internal memory
Requirement will not also increase a lot.When support threshold values is identical, standard deviation is different, memory headroom needed for the small side of standard deviation
It is larger.For example, when count threshold is 10,000, standard error internal memory for needed for 500 ACMP counters is that standard error is
4.6 times of 1000.Because the raising of the degree of accuracy needs more counters, more internal memories are consumed.
Fig. 5 is that the present invention compares two solutions based on confidence level.The present invention set gradually confidence level
For 75%, 85%, 90%, 92.5%, 95%, 97.5% and 99%, count threshold is set to 10,000 and by standard deviation
It is set to 500.As can be seen that with the raising of confidence level, ACMP also increases to the demand of internal memory with exponential.Therefore,
It is feasible by the degree of accuracy of the memory size to improve estimate is improved when the degree of accuracy of counting is not high.
Fig. 6 is the memory requirements that the present invention compares two schemes according to item collection size.The present invention is by item collection size from 100
Hundred million to 5 hundred million are set gradually, and count threshold is arranged on into 10000 and standard error is arranged on into 500.As can be seen that with item
The reduction of collection, two schemes are linearly reduced to the demand of internal memory.But, item collection size is determined by the size of database, it is not necessary to
Adjustment.However, in the MapReduce approximately counted, average noise increases with the increase of item collection, therefore can be by increasing
Plus memory space reduces noise.
Fig. 7 is the memory requirements that the present invention compares two schemes according to parameter ι.The present invention by parameter ι set gradually for
1024th, 2048,4096,8192,12288 and 16384, count threshold is set to 10,000 and standard error is set to
500.As can be seen that the memory headroom needed for ACMP is linearly increasing with ι increase.Therefore, parameter ι value should not be set
Put excessive.
Claims (2)
1. the big data approximate evaluation method based on MapReduce, the MapReduce model of the approximate counting of proposition, each
Item l counter of correspondence, and each counter is counted to multiple, the quantity of item is much larger than the quantity of counter, by
This reduces occupancy and communication cost of the counter to memory headroom, it is characterised in that when being counted to electronic transaction data,
It is not one-to-one between item and counter, the number of counter is much smaller than the number of item;In the processing to intermediate file
On, by being grouped to Map virtual machines, reduce communication cost;In terms of the degree of accuracy of count results, count results are not accurate
True result, but its confidential interval and standard error are controllable;In memory headroom use, by reducing making for counter
With, the use to memory headroom is greatly reduced,
In Map stages, a total of s virtual machine, in Reduce stages, a total of t virtual machine, present invention acquiescence s and t choosing
Take and must assure thatT is much larger than for integer and s,
In the Map stages, virtual machine is divided into t groups, and every group hasIndividual virtual machine, corresponds to a virtual machine in Reduce stages respectively,
After the Map stages terminate, all numerical value can be sent to corresponding Reduce virtual machines by the virtual machine in Map stages;
Map the and Reduce stages have a s+t platform virtual machines, what the configurations of all virtual machines was just as, i.e., every virtual machine is all
There is m counter, this m counter is divided into l groups, then, the quantity of every group of counter isIt is individual, present invention acquiescence m and l
Selection must assure thatFor integer, i-th group of counter of the virtual machine is designated as Mi, 0≤i<L, j-th of counting in i-th group
Device is designated as Mi[j],
For any one k, k each appearance, VM can generate random number a r, 0≤r<L, and perform hash function,
I.e.Generate h,And Mr[h] Jia 1, because what r was randomly generated, therefore item k each appearance
Will be by any one counter records in l counter, i.e. M0[H (k, 0)], M1[H(k,1)]…Ml-1[H (k, l-1)],
These counters are referred to as k representative counter, and the counting to item k is only counted as the representative counter corresponding to item k, and
And in any one virtual machine, the index of the representative counter corresponding to item k is identical, therefore the counting of different virtual machine
Value is added according to counter index.
2. the big data approximate evaluation method according to claim 1 based on MapReduce, it is characterised in that k counting
Value is specifically calculated as follows:
Appoint and take one group of Counter Value in П, be labeled as M, for M, it has m count value, and is divided into l groups, and every group has
Individual count value, in addition, because the subscript of the representative counter corresponding to item k is fixed, also there is the l representative of a k in M
Counter, is designated as Mi[H(k,i)](0≤i<L), the count value on k obtained by the present invention, is not k actual value, and
It is k reasonable estimate, the present invention uses Ω (M, K) as by the M k recorded actual value, in addition, each representing counter Mi
[H (k, i)] can all generate an estimate, be designated as Ω i ',
Mi[H (k, i)] have recorded a k and other occurrence number, in Mi [H (k, i)] the inner present invention with stochastic variable Y come table
Show k occurrence number, other occurrence number is represented with stochastic variable Z, k each appearance is by counter Mi[H (k, i)] remembers
The probability recorded isBecause there is l to represent counter, therefore, Y obeys bi-distributionWhen Ω (M,
K when) sufficiently large,Can be by being just distributed very much expression, i.e.,:
Other Arbitrary Term k ' occurrence number is by Mi [H (k, the i)] probability recordedBecause m counter has together
The chance of sample records k ', allows present invention n (M) as other all key number of times summation, Z obeys bi-distributionBecause Ω (M, K) is much smaller than n (M), therefore Z can equally meet Gaussian distributions,
I.e.:
Take stochastic variable X, and make X=Y+Z, Mi [H (k, i)] they be a constant in X distributions, because two independent
Gaussian distribution additions are also Gaussian distributions, i.e. X is also Gaussian distributions, thus can be released (3), i.e.,:
Ω (M, K) estimate is obtained using this method, its accuracy, Ω (M, K) reason is analyzed using variable later
It is as follows by being worth:
E (X) is theoretical value, thus can by actual observation to Mi [H (k, i)] value replace, thus can release estimating for Ω (M, K)
Evaluation, is designated as Ω i ', i.e.,:
The average value of above-mentioned l theoretical value is taken, Ω (M, K) is designated as, then can equally obtain being averaged for above-mentioned l estimate
Value, will be designated as Ω (M, K) ', i.e.,:
K occurrence number is recorded by t counting group, and k total occurrence number is designated as into Ω (K), can be obtained:
Ω (K)=∑M∈ПΩ(M,k) (7)
The estimate of the total occurrence numbers of k is designated as Ω (K) ', i.e.,:
Ω (K) '=∑sM∈ПΩ(M,k)′ (8)
Because Mi [H (k, i)] (0 < < i<L) independent sample being distributed in formula (3) as X, so can be drawn by (5),
Ω i ' (1≤i≤l) are the random independent variable that Gaussian shown below is distributed:
It is inner in (6), Ω (M, K) ' (1≤i≤l) distribution such as following formula:
Therefore, the estimate Ω (K) ' of the total occurrence numbers of k, meets following distribution:
Above formula is applied to formula (7), had
Finally, the present invention obtains the average value and variance of estimate, specific as follows:
Therefore, when 1- α are confidence levels, existSo that confidential interval is:
Consider a random key k, the actual value recorded by counter all in П is Ω (M, K), gives confidence water
Flat 1- α, then counting estimate Ω (M, K) ' should be in confidence level 1- α prescribed limit;
Therefore, the degree of accuracy should be:
Prob{Ω(M,K)′∈[Ω(M,K)-v,Ω(M,K)+v]}>1-α (11)
The target to be reached of the present invention is on the premise of degree of accuracy desired value is met, the memory requirements of system to be minimized, i.e.,
The minimum of counter number, is designated as m, and m is the quantity of each virtual machine Counter;
By equation (10), obtain:
By formula (12), standard value more lower number of than nonce counter, the relation met such as formula (13) are obtained:
Wherein, l represents the counter packet in virtual machine, and α represents confidence level, and v represents the standard deviation of count results, and c is represented
Count threshold, element ∑M∈ПN (M) refers to task W size, and m represents the quantity of each virtual machine Counter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710230053.6A CN106997303B (en) | 2017-04-10 | 2017-04-10 | MapReduce-based big data approximate processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710230053.6A CN106997303B (en) | 2017-04-10 | 2017-04-10 | MapReduce-based big data approximate processing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106997303A true CN106997303A (en) | 2017-08-01 |
CN106997303B CN106997303B (en) | 2020-07-17 |
Family
ID=59435113
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710230053.6A Active CN106997303B (en) | 2017-04-10 | 2017-04-10 | MapReduce-based big data approximate processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106997303B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107436954A (en) * | 2017-08-16 | 2017-12-05 | 吉林大学 | A kind of online flow data approximate processing method of quality control and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102609347A (en) * | 2012-02-17 | 2012-07-25 | 江苏南开之星软件技术有限公司 | Method for detecting load hotspots in virtual environment |
US20120311581A1 (en) * | 2011-05-31 | 2012-12-06 | International Business Machines Corporation | Adaptive parallel data processing |
US20130185729A1 (en) * | 2012-01-13 | 2013-07-18 | Rutgers, The State University Of New Jersey | Accelerating resource allocation in virtualized environments using workload classes and/or workload signatures |
CN103544258A (en) * | 2013-10-16 | 2014-01-29 | 国家计算机网络与信息安全管理中心 | Cardinal number estimating method and cardinal number estimating device under multi-section query condition of big data |
-
2017
- 2017-04-10 CN CN201710230053.6A patent/CN106997303B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120311581A1 (en) * | 2011-05-31 | 2012-12-06 | International Business Machines Corporation | Adaptive parallel data processing |
US20130185729A1 (en) * | 2012-01-13 | 2013-07-18 | Rutgers, The State University Of New Jersey | Accelerating resource allocation in virtualized environments using workload classes and/or workload signatures |
CN102609347A (en) * | 2012-02-17 | 2012-07-25 | 江苏南开之星软件技术有限公司 | Method for detecting load hotspots in virtual environment |
CN103544258A (en) * | 2013-10-16 | 2014-01-29 | 国家计算机网络与信息安全管理中心 | Cardinal number estimating method and cardinal number estimating device under multi-section query condition of big data |
Non-Patent Citations (2)
Title |
---|
WEI YAN 等: ""Scalable Load Balancing for MapReduce-based Record Linkage"", 《2013 IEEE 32ND INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE(IPCCC)》 * |
童维勤 等: "《数据密集型计算和模型》", 31 January 2015 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107436954A (en) * | 2017-08-16 | 2017-12-05 | 吉林大学 | A kind of online flow data approximate processing method of quality control and device |
Also Published As
Publication number | Publication date |
---|---|
CN106997303B (en) | 2020-07-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11853283B2 (en) | Dynamic aggregate generation and updating for high performance querying of large datasets | |
Kalodner et al. | {BlockSci}: Design and applications of a blockchain analysis platform | |
CN102915347B (en) | A kind of distributed traffic clustering method and system | |
Lee et al. | Large-scale incremental processing with MapReduce | |
CN107766402A (en) | A kind of building dictionary cloud source of houses big data platform | |
US20160217193A1 (en) | Database and method for evaluating data therefrom | |
CN107229995A (en) | Realize method, device and computer-readable recording medium that game service amount is estimated | |
US10929361B2 (en) | Rule-based data source selection | |
Amossen | Vertical partitioning of relational OLTP databases using integer programming | |
CN109408590A (en) | Expansion method, device, equipment and the storage medium of distributed data base | |
CN105117442A (en) | Probability based big data query method | |
CN108205713A (en) | A kind of region wind power prediction error distribution determination method and device | |
CN108416054A (en) | Dynamic HDFS copy number calculating methods based on file access temperature | |
CN105677645A (en) | Data sheet comparison method and device | |
CN106034144A (en) | Load-balancing-based virtual asset data storage method | |
CN106997303A (en) | Big data approximate evaluation method based on MapReduce | |
Wang et al. | Gradient scheduling with global momentum for asynchronous federated learning in edge environment | |
CN110580307B (en) | Processing method and device for fast statistics | |
CN104778088A (en) | Method and system for optimizing parallel I/O (input/output) by reducing inter-progress communication expense | |
CN105631047A (en) | Hierarchically-cascaded data processing method and hierarchically-cascaded data processing system | |
Yang et al. | Probabilistic modeling of renewable energy source based on Spark platform with large‐scale sample data | |
Zhang et al. | An optimization algorithm applied to the class integration and test order problem | |
CN110489460B (en) | Optimization method and system for rapid statistics | |
CN108520053B (en) | Big data query method based on data distribution | |
Venkateswaran et al. | Using machine learning for intelligent shard sizing on the cloud |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |