CN107193862A

CN107193862A - A kind of variance optimization histogram construction method and device based on Spark Streaming

Info

Publication number: CN107193862A
Application number: CN201710212747.7A
Authority: CN
Inventors: 史亮; 王勇; 张鸿; 何慧虹
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2017-04-01
Filing date: 2017-04-01
Publication date: 2017-09-22

Abstract

The present invention relates to a kind of variance optimization histogram construction method and device based on Spark Streaming, the method includes：On-line sampling is carried out using Spark Streaming stream datas；According to on-line sampling data, dynamic construction variance optimization histogram；The variance optimization histogram is updated using the new data dynamic that adds, and data dynamic construction variance optimization histogram again is added according to new；The technical scheme that the present invention is provided, can limited in memory headroom by the high-precision approximating variances optimization histogram of single pass data structure.

Description

It is a kind of based on Spark Streaming variance optimization histogram construction method and Device

Technical field

The present invention relates to big data calculating field, and in particular to a kind of variance optimization based on Spark Streaming is straight Square figure construction method and device.

Background technology

With developing rapidly for internet, Internet of Things, cloud computing and the communication technology, the explosive growth of data is many Industry brings brand-new opportunity to develop and challenge, for the emphasis being efficiently calculated as to study at present of mass data.Greatly The computation schema of data can be divided into batch calculate and streaming computing both of which, due to flow data have continuation, it is ageing, The characteristic such as sudden and data distribution is unknown, therefore calculated than the batch of off-line data, the online treatment technology of flow data It is not perfect, and the demand field of the application such as efficient calculating and statistical analysis with the development stream data of technology is increasingly It is high.It is just past 2016 that " the double 11 " activities of day cat, transaction peak value per second is up to 17.5 ten thousand, branch per second in first 30 minutes Pay peak value 120,000,1 hour 57 second period turnover just more than 2013 double 11 whole days turnover；China Mobile pair The user of 2014 has found that the talk times of average minute clock are up to 8,000,000 times, and the flow of mobile data is using data statistics 33GB per second, with 4G comprehensive covering and developing rapidly for 5G technologies, this numeral is also being skyrocketed through.Therefore, fluxion According to treatment technology the numerous areas such as the real-time high-efficiency of network monitor, security monitoring and user data analysis are proposed more High requirement.

One of method that flow data is commonly used in calculating is to build outline data structure, and the flow data that magnanimity is reached at a high speed is deposited Storage is in specific outline data structure, come the outline data information for supporting quickly to obtain, statistic analysis result and high-precision Approximate query.Variance optimization histogram therein is a kind of outline data structure of application field widely.Variance optimization is straight Square figure is a kind of special histogram technology, the characteristics of variance sum with data in each bucket is minimum.Variance optimizes histogram The profile of expression large data sets that not only can intuitively, succinctly, due to the characteristic that its variance is optimal, for any area on data set Between aggregate query also there is very high precision.Therefore, variance optimization histogram in flow data high efficiency range aggregate query and The fields such as data base querying are in occupation of very important status.

Traditional algorithm based on Dynamic Programming needs multiple ergodic data collection, therefore time and spatial complexity is larger； Also one kind can build variance to arbitrary data collection within the sublinear time and optimize histogrammic method, but method can only be directed to from Line number evidence, it is impossible to meet the histogrammic structure of variance optimization under flow data environment；It is limited for memory headroom under flow data environment Premise, academia it is also proposed the histogram that a kind of utilization sample data builds variance optimization at present, but before the structure of the method It is to know data distribution in advance to carry, can be to the flow data stochastical sampling that continuously reaches according to the Data distribution information of acquisition； In addition, there is a kind of approximating variances optimization histogram method of dynamic adjustment at present, correspondence will be inserted per the element newly arrived Bucket in, by the division to bucket and merge the variance sum near-optimization for make it that histogram is overall, the advantage of this method is significantly Reducing structure variance and optimizing histogrammic time complexity, but have the disadvantage to need to preserve all initial data can just treat point Split bucket and bucket to be combined carries out the calculating of variance, therefore the method is unfavorable for the dynamic structure in the case where limiting the streaming big data environment in space Build variance optimization histogram；At present, under distributed computing framework, it is proposed that one kind is using MapReduce Computational frames to general Data in rate database build approximating variances and optimize histogrammic method, but this method can only be counted for off-line data Calculate, it is impossible to which rapid build is carried out to online flow data, and currently a popular flow data calculating platform does not also provide calculating side The histogrammic method of difference optimization.

Accordingly, it is desirable to provide one kind can efficiently build variance under distributed stream data environment optimizes histogrammic method To meet widespread need of the variance optimization histogram under flow data environment.

The content of the invention

The present invention provides a kind of variance optimization histogram construction method and device based on Spark Streaming, its mesh Be that can build high-precision approximating variances optimization histogram in memory headroom by single pass data limiting.

The purpose of the present invention is realized using following technical proposals：

A kind of variance optimization histogram construction method based on Spark Streaming, it is theed improvement is that, including：

On-line sampling is carried out using Spark Streaming stream datas；

According to on-line sampling data, dynamic construction variance optimization histogram；

The variance optimization histogram is dynamically updated using the new data that add, and data dynamic construction again is added according to new Variance optimizes histogram.

It is preferred that, it is described to carry out on-line sampling using Spark Streaming stream datas, including：

RDD time interval is set, is converted to flow data according to time interval using Spark Streaming DStream structures, and window () operating parameter of RDD in DStream structures is called, the RDD in DStream structures is gathered Window is combined into, then on-line sampling is carried out to the data in each window；

Wherein, the time interval of window is the positive integer times of RDD time interval.

Further, the data in each window carry out on-line sampling, including：

At the beginning of the sample data of the sample space of window is set as into K, and will sample threshold value T, variable sum and variable count Beginning turns to 0, and greatest member number is initialized for K small top pile structure；

Data in the sample space of data storage in window to window, and in window are updated into the sampling threshold Value T；

Wherein, when the sample data of the sample space of window is K+1, selected from the sample data of the sample space of window Select the sample data ω for meeting constraints_iAnd ω_j, and by sample data ω_iAnd ω_jThe small data accumulation of middle data value is to number According to being worth in big data, the sample data ω is deleted_iAnd ω_jThe small data of middle data value.

Further, the constraints includes：Sample data ω_iWith ω_jSum is less than the sampling threshold value T, and sample Data ω_iAnd ω_jSampling cost it is minimum in the sample space of window；

Wherein, sample data ω is determined as the following formula_iAnd ω_jSampling cost J：

In above formula, ω_iFor i-th of sample data, ω_jFor j-th of sample data, i<j；

Further, the data in window update the sampling threshold value T, including：

Judge whether the data value in window is less than sampling threshold value T；

If so, then the data value is added on variable sum, 1 is increased certainly with variations per hour count；

If it is not, then by the data storage into the small top pile structure；

Wherein, when the data amount check in small top pile structure reaches K, or the heap of small top pile structure serves as a fill-in the value of evidence and is less than and works as During preceding sampling threshold value T, then the value that heap serves as a fill-in evidence is added on variable sum, with variations per hour count from increasing 1, and by the heap top Data are deleted, and the value for the threshold value T that samples is updated into sum/count.

It is preferred that, the data according to on-line sampling, dynamic construction variance optimization histogram, including：

The maximum sample number that the sample space of set memory is allowed is K_maxAnd according to time interval setting variance optimization The number of histogrammic bucket is B；

With the K first reached in the sample space of internal memory_maxThe data of individual online acquisition build the wide Nogata that bucket number is B Figure, wherein, it is right to time interval institute where it to add the data of online acquisition according to the acquisition time of the data of online acquisition The variance answered optimizes in histogrammic bucket, and determines the variance of each bucket using the data value in each bucket, builds histogram.

Further, when the data of the sample space of internal memory are K_maxAt+1, according to the sampling time of the data newly added It is added into the variance to where it corresponding to time interval to optimize in histogrammic bucket, while updating the variance of this barrel, and selects The bucket of variance maximum in current variance optimization histogram is selected as bucket to be divided, the minimum bucket of variance and its adjacent variance are smaller A bucket as two buckets to be combined, if the variance of total data in two buckets to be combined is less than data in bucket to be divided Variance, then perform split degree operation, conversely, not performing any operation then；

When the data of the sample space of internal memory are K_max+ 1, and the data newly added sampling time not variance optimize In the time zone of histogrammic bucket, then add it in the corresponding bucket of its sampling time nearest time zone, more The variance of this new barrel, and the bucket of variance maximum in current variance optimization histogram is selected as bucket to be divided, the minimum bucket of variance And its adjacent less bucket of variance is as two buckets to be combined, if data variance in two buckets to be combined and being less than The variance of data in bucket to be divided, then perform split degree operation, conversely, not performing any operation then.

It is preferred that, it is described dynamically to update the variance optimization histogram using the new data that add, including：

The variance built in set memory optimizes histogrammic time window, if newly arrived number in the sample space of internal memory Exceed the time window according to the corresponding sampling time, then compress the variance optimization histogram in the time window.

Further, the variance optimization histogram in the described pair of time window is compressed, including：

Retain the data boundary of each bucket in the variance optimization histogram in the time window, delete data boundary in each bucket Outer data, wherein, the data boundary of each bucket includes：Minimal sampling time number corresponding with the maximum sampling time in each bucket The average value of all data according to this and in this barrel.

A kind of variance optimization histogram construction device based on Spark Streaming, it is theed improvement is that, the dress Put including：

Sampling module, for carrying out on-line sampling using Spark Streaming stream datas；

Module is built, for according to on-line sampling data, dynamic construction variance optimization histogram；

Update module, for dynamically updating the variance optimization histogram using the new data that add, and adds number according to new Optimize histogram according to dynamic construction variance again.

It is preferred that, the sampling module, including：

RDD time interval is set, is converted to flow data according to time interval using Spark Streaming DStream structures, and window () operating parameter of RDD in DStream structures is called, RDD in DStream structures is polymerize For window, then on-line sampling is carried out to the data in each window；

Further, the data in each window carry out on-line sampling, including：

Further, the data in window update the sampling threshold value T, including：

If it is not, then by the data storage into the small top pile structure；

Wherein, when the data amount check in small top pile structure reaches K, or the heap of small top pile structure serves as a fill-in the value of evidence and is less than and works as During preceding sampling threshold value T, then the value that heap serves as a fill-in evidence is added on variable sum, with variations per hour count from increasing 1, and by the heap top Data are deleted, and the value for updating sampling threshold value T is sum/count.

It is preferred that, the structure module, including：

It is preferred that, the update module, including：

With immediate prior art ratio, the technical scheme that the present invention is provided has the advantages that：

The technical scheme that the present invention is provided, solving existing method can not enter to the magnanimity of Unknown Distribution, high speed flow data The problem of efficient variance optimization histogram of row is built, polymerization that can be interval when limiting support high-precision real in memory headroom is looked into Ask, not only combine Spark Streaming height and handle up, low delay, support fault-tolerant characteristic, while designed method is to grasp The form for making operator is added in Spark Streaming calculating platforms, efficiently solves the structure side under distributed environment The histogrammic problem of difference optimization, by RDD conversion operations new under Spark Streaming, is also achieved in linear session Data in setting window are carried out with the optimization sampling of online variance；By in internal memory dynamic construction variance optimize it is histogrammic Method, variance optimization histogram can be built without complicated dynamic programming algorithm limiting in space；The technology that the present invention is provided Scheme realizes the variance under high speed magnanimity flow data environment and optimizes histogrammic Dynamic Maintenance and support interval in real time gather Inquiry is closed, with Spark Streaming height is fault-tolerant, low delay characteristic.

Brief description of the drawings

Fig. 1 is the flow chart that a kind of variance based on Spark Streaming of the present invention optimizes histogram construction method；

Fig. 2 is the overall flow that a kind of variance based on Spark Streaming of the present invention optimizes histogram construction method Figure；

Fig. 3 is using Spark Streaming stream datas to carry out online variance optimization sample streams in the embodiment of the present invention Cheng Tu；

Fig. 4 is dynamic construction variance optimization histogram method flow chart in internal memory in the embodiment of the present invention；

Fig. 5 is the structural representation that a kind of variance based on Spark Streaming of the present invention optimizes histogram construction device Figure.

Embodiment

The embodiment to the present invention elaborates below in conjunction with the accompanying drawings.

To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is A part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art The all other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.

With the explosive growth of current network data, the real-time analytical technology of stream data becomes the popular neck of research Domain.Variance optimization histogram can return to the interval polymerization result of high accuracy limiting in space, have in data statistics, analysis field Extremely important with being widely applied.But, because variance optimizes the complexity that histogram is built, currently without a kind of flow data Computing system can provide efficient online variance optimization histogram construction method, and therefore, the present invention devises one kind and is based on Spark Streaming variance optimization histogram construction method, is realized under conditions of memory headroom is limited, to Unknown Distribution Out of order flow data carry out that the optimization of efficient variance is histogrammic to be built, and optimize histogram construction method and existing with traditional variance Some flow data approximating variances optimization histogram construction methods are contrasted, test result indicates that, this method is limiting internal memory sky It is interior that high-precision approximating variances optimization histogram can be built by single pass data, solve existing flow data and calculate System can not efficiently build variance and optimize histogrammic problem, and main contents of the present invention include three parts：1. in Spark Under Streaming environment, the data in DStream are carried out with online variance optimization and is sampled；2. sampled in each window of persistence Sample data afterwards uses sample data excellent with merging structure approximating variances by the Dynamic Division of histogram bucket into internal memory Change histogram；3. variance optimizes histogrammic Dynamic Maintenance in internal memory, and supports interval aggregate query in real time, as shown in figure 1, bag Include：

101. carry out on-line sampling using Spark Streaming stream datas；

102. according to on-line sampling data, dynamic construction variance optimization histogram；

103. dynamically updating the variance optimization histogram using the new data that add, and data dynamic again is added according to new Variance optimization histogram is built, i.e., returning to the step 102 according to new addition data, state builds variance optimization histogram again.

Specifically, the step 101, including：

RDD time interval is set, is converted to flow data according to time interval using Spark Streaming DStream structures, and window () operating parameter of RDD in DStream structures is called, RDD in DStream structures is polymerize For window, on-line sampling further is carried out to the data in each window；

The data in each window carry out on-line sampling, including：

The sample data of sample space of window is set as K, initialization sampling threshold value T, variable sum and variable count are 0, and initialize the small top pile structure that greatest member number is K；

Wherein, when the sample data of the sample space of window is K+1, selected from the sample data of the sample space of window Select the sample data ω for meeting constraints_iAnd ω_j, and by sample data ω_iAnd ω_jThe small data accumulation of middle data value is to number According to being worth in big data, while deleting the sample data ω_iAnd ω_jThe small data of middle data value.

The constraints includes：Sample data ω_iWith ω_jSum is less than the sampling threshold value T, and sample data ω_iWith ω_jSampling cost it is minimum in the sample space of window；

The data in window update the sampling threshold value T, including：

If it is not, then by the data storage into the small top pile structure；

The step 102, including：

Utilize the K first reached in the sample space of internal memory_maxIt is the wide straight of B that the data of individual online acquisition, which build bucket number, Fang Tu, wherein, the data of online acquisition are added to time interval institute where it according to the acquisition time of the data of online acquisition Corresponding variance optimizes in histogrammic bucket, and determines the variance of each bucket using the data value in each bucket, builds histogram.

Wherein, when the data of the sample space of internal memory are K_maxAt+1, according to the data newly added sampling time by its Add to the variance corresponding to time interval where it and optimize in histogrammic bucket, while updating the variance of this barrel, and select to work as The maximum bucket of variance is used as bucket to be divided, the minimum bucket of variance and its adjacent variance less one in the difference optimization histogram of front Individual bucket is as two buckets to be combined, if the variance of the total data in two buckets to be combined is less than the side of data in bucket to be divided Difference, then perform split degree operation, if data variance and not less than data in bucket to be divided the side in two buckets to be combined Difference, then without any operation；

When the data of the sample space of internal memory are K_max+ 1, and the data newly added sampling time not variance optimize In the time zone of histogrammic bucket, then add it in the corresponding bucket of its sampling time nearest time zone, more The variance of this new barrel, and the bucket of variance maximum in current variance optimization histogram is selected as bucket to be divided, the minimum bucket of variance And its adjacent less bucket of variance is as two buckets to be combined, if data variance in two buckets to be combined and being less than The variance of data in bucket to be divided, then perform split degree operation, if data variance in two buckets to be combined and being not less than The variance of data in bucket to be divided, then without any operation.

In the step 103, the variance optimization histogram is dynamically updated using the new data that add, including：

The variance built in set memory optimizes histogrammic time window, if newly arrived number in the sample space of internal memory Exceed the time window according to the corresponding sampling time, then the variance optimization histogram in the time window is compressed.

Wherein, the variance optimization histogram in the described pair of time window is compressed, including：

Retain the data boundary of each bucket in the variance optimization histogram in the time window, delete and number of boundary is removed in each bucket According to outer data, wherein, the data boundary of each bucket includes：Minimal sampling time is corresponding with the maximum sampling time in each bucket The average value of all data in data and this barrel.

A kind of variance optimization histogram construction device based on Spark Streaming, as shown in figure 5, described device bag Include：

The sampling module, including：

The data in each window carry out on-line sampling, including：

The data in window update the sampling threshold value T, including：

If it is not, then by the data storage into the small top pile structure；

The structure module, including：

The update module, including：

It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the application can be used in one or more computers for wherein including computer usable program code The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The application is the flow with reference to method, equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram are described.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

Finally it should be noted that：The above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, to the greatest extent The present invention is described in detail with reference to above-described embodiment for pipe, those of ordinary skills in the art should understand that：Still The embodiment of the present invention can be modified or equivalent substitution, and without departing from any of spirit and scope of the invention Modification or equivalent substitution, it all should cover within the claims of the present invention.

Claims

1. a kind of variance optimization histogram construction method based on Spark Streaming, it is characterised in that methods described bag Include：

On-line sampling is carried out using Spark Streaming stream datas；

The variance optimization histogram is dynamically updated using the new data that add, and data dynamic construction variance again is added according to new Optimize histogram.

2. the method as described in claim 1, it is characterised in that described to be carried out using Spark Streaming stream datas Line is sampled, including：

RDD time interval is set, flow data is converted into DStream knots according to time interval using Spark Streaming Structure, and window () operating parameter of RDD in DStream structures is called, the RDD in DStream structures is polymerized to window, On-line sampling is carried out to the data in each window again；

3. method as claimed in claim 2, it is characterised in that the data in each window carry out on-line sampling, including：

The sample data of the sample space of window is set as K, and will sampling threshold value T, variable sum and variable count initialization For 0, and greatest member number is initialized for K small top pile structure；

Wherein, when the sample data of the sample space of window is K+1, select full from the sample data of the sample space of window The sample data ω of sufficient constraints_iAnd ω_j, and by sample data ω_iAnd ω_jThe small data accumulation of middle data value is to data value In big data, the sample data ω is deleted_iAnd ω_jThe small data of middle data value；

It is preferred that, the constraints includes：Sample data ω_iWith ω_jSum is less than the sampling threshold value T, and sample data ω_iAnd ω_jSampling cost it is minimum in the sample space of window；

<mrow> <mi>J</mi> <mo>=</mo> <msub> <mi>&omega;</mi> <mi>i</mi> </msub> <mo>&CenterDot;</mo> <msub> <mi>&omega;</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>&CenterDot;</mo> <mo>(</mo> <mrow> <msub> <mi>&omega;</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>&omega;</mi> <mi>j</mi> </msub> </mrow> <mo>)</mo> <mo>+</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>m</mi> <mo>=</mo> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> <mrow> <mi>j</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <msub> <mi>&omega;</mi> <mi>m</mi> </msub> <mo>)</mo> </mrow> </mrow>

It is preferred that, the data in window update the sampling threshold value T, including：

If it is not, then by the data storage into the small top pile structure；

Wherein, when the data amount check in small top pile structure reaches that K, or the value of the useful evidence of heap of small top pile structure are less than currently During sampling threshold value T, then the value that heap serves as a fill-in evidence is added on variable sum, with variations per hour count from increasing 1, and the heap is served as a fill-in into evidence Delete, the value for the threshold value T that samples is updated to sum/count.

4. the method as described in claim 1, it is characterised in that the data according to on-line sampling, dynamic construction variance is excellent Change histogram, including：

The maximum sample number that the sample space of set memory is allowed is K_maxAnd according to time interval setting variance optimization Nogata The number of the bucket of figure is B；

With the K first reached in the sample space of internal memory_maxThe data of individual online acquisition build the wide histogram that bucket number is B, its In, the data of online acquisition are added to the side corresponding to time interval where it according to the acquisition time of the data of online acquisition Difference optimizes in histogrammic bucket, and determines the variance of each bucket using the data value in each bucket, builds histogram；

It is preferred that, when the data of the sample space of internal memory are K_maxAt+1, added according to the sampling time of the data newly added Enter to the variance corresponding to time interval where it and optimize in histogrammic bucket, while updating the variance of this barrel, and select current The maximum bucket of variance is used as bucket to be divided, the minimum bucket of variance and its less one adjacent of variance in variance optimization histogram Bucket is as two buckets to be combined, if the variance of the total data in two buckets to be combined is less than the side of data in bucket to be divided Difference, then perform split degree operation, conversely, not performing any operation then；

When the data of the sample space of internal memory are K_max+ 1, and the data newly added sampling time not variance optimize Nogata In the time zone of the bucket of figure, then add it in the corresponding bucket of its sampling time nearest time zone, updating should The variance of bucket, and select in current variance optimization histogram the maximum bucket of variance as bucket divide, the bucket of variance minimum and its The less bucket of adjacent variance is as two buckets to be combined, if data variance in two buckets to be combined and less than treating point The variance of data in bucket is split, then split degree operation is performed, conversely, not performing any operation then.

5. the method as described in claim 1, it is characterised in that described dynamically to update the variance optimization using the new data that add Histogram, including：

The variance built in set memory optimizes histogrammic time window, if newly arrived data pair in the sample space of internal memory The sampling time answered exceeds the time window, then compresses the variance optimization histogram in the time window；

It is preferred that, the variance optimization histogram compressed in the time window, including：

Retain the data boundary of each bucket in the variance optimization histogram in the time window, delete in each bucket outside data boundary Data, wherein, the data boundary of each bucket includes：In each bucket minimal sampling time data corresponding with the maximum sampling time with And in this barrel all data average value.

6. a kind of variance optimization histogram construction device based on Spark Streaming, it is characterised in that described device bag Include：

Update module, for dynamically updating the variance optimization histogram using the new data that add, and adds data weight according to new New dynamic construction variance optimization histogram.

7. device as claimed in claim 6, it is characterised in that the sampling module, including：

RDD time interval is set, flow data is converted into DStream knots according to time interval using Spark Streaming Structure, and window () operating parameter of RDD in DStream structures is called, RDD in DStream structures is polymerized to window, then On-line sampling is carried out to the data in each window；

8. device as claimed in claim 7, it is characterised in that the data in each window carry out on-line sampling, including：

If it is not, then by the data storage into the small top pile structure；

Wherein, when the data amount check in small top pile structure reaches that K, or the value of the useful evidence of heap of small top pile structure are less than currently During sampling threshold value T, then the value that heap serves as a fill-in evidence is added on variable sum, with variations per hour count from increasing 1, and the heap is served as a fill-in into evidence Delete, the value for updating sampling threshold value T is sum/count.

9. device as claimed in claim 6, it is characterised in that the structure module, including：

10. device as claimed in claim 6, it is characterised in that the update module, including：