CN107193862A - A kind of variance optimization histogram construction method and device based on Spark Streaming - Google Patents

A kind of variance optimization histogram construction method and device based on Spark Streaming Download PDF

Info

Publication number
CN107193862A
CN107193862A CN201710212747.7A CN201710212747A CN107193862A CN 107193862 A CN107193862 A CN 107193862A CN 201710212747 A CN201710212747 A CN 201710212747A CN 107193862 A CN107193862 A CN 107193862A
Authority
CN
China
Prior art keywords
data
variance
bucket
window
sampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710212747.7A
Other languages
Chinese (zh)
Inventor
史亮
王勇
张鸿
何慧虹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201710212747.7A priority Critical patent/CN107193862A/en
Publication of CN107193862A publication Critical patent/CN107193862A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention relates to a kind of variance optimization histogram construction method and device based on Spark Streaming, the method includes:On-line sampling is carried out using Spark Streaming stream datas;According to on-line sampling data, dynamic construction variance optimization histogram;The variance optimization histogram is updated using the new data dynamic that adds, and data dynamic construction variance optimization histogram again is added according to new;The technical scheme that the present invention is provided, can limited in memory headroom by the high-precision approximating variances optimization histogram of single pass data structure.

Description

It is a kind of based on Spark Streaming variance optimization histogram construction method and Device
Technical field
The present invention relates to big data calculating field, and in particular to a kind of variance optimization based on Spark Streaming is straight Square figure construction method and device.
Background technology
With developing rapidly for internet, Internet of Things, cloud computing and the communication technology, the explosive growth of data is many Industry brings brand-new opportunity to develop and challenge, for the emphasis being efficiently calculated as to study at present of mass data.Greatly The computation schema of data can be divided into batch calculate and streaming computing both of which, due to flow data have continuation, it is ageing, The characteristic such as sudden and data distribution is unknown, therefore calculated than the batch of off-line data, the online treatment technology of flow data It is not perfect, and the demand field of the application such as efficient calculating and statistical analysis with the development stream data of technology is increasingly It is high.It is just past 2016 that " the double 11 " activities of day cat, transaction peak value per second is up to 17.5 ten thousand, branch per second in first 30 minutes Pay peak value 120,000,1 hour 57 second period turnover just more than 2013 double 11 whole days turnover;China Mobile pair The user of 2014 has found that the talk times of average minute clock are up to 8,000,000 times, and the flow of mobile data is using data statistics 33GB per second, with 4G comprehensive covering and developing rapidly for 5G technologies, this numeral is also being skyrocketed through.Therefore, fluxion According to treatment technology the numerous areas such as the real-time high-efficiency of network monitor, security monitoring and user data analysis are proposed more High requirement.
One of method that flow data is commonly used in calculating is to build outline data structure, and the flow data that magnanimity is reached at a high speed is deposited Storage is in specific outline data structure, come the outline data information for supporting quickly to obtain, statistic analysis result and high-precision Approximate query.Variance optimization histogram therein is a kind of outline data structure of application field widely.Variance optimization is straight Square figure is a kind of special histogram technology, the characteristics of variance sum with data in each bucket is minimum.Variance optimizes histogram The profile of expression large data sets that not only can intuitively, succinctly, due to the characteristic that its variance is optimal, for any area on data set Between aggregate query also there is very high precision.Therefore, variance optimization histogram in flow data high efficiency range aggregate query and The fields such as data base querying are in occupation of very important status.
Traditional algorithm based on Dynamic Programming needs multiple ergodic data collection, therefore time and spatial complexity is larger; Also one kind can build variance to arbitrary data collection within the sublinear time and optimize histogrammic method, but method can only be directed to from Line number evidence, it is impossible to meet the histogrammic structure of variance optimization under flow data environment;It is limited for memory headroom under flow data environment Premise, academia it is also proposed the histogram that a kind of utilization sample data builds variance optimization at present, but before the structure of the method It is to know data distribution in advance to carry, can be to the flow data stochastical sampling that continuously reaches according to the Data distribution information of acquisition; In addition, there is a kind of approximating variances optimization histogram method of dynamic adjustment at present, correspondence will be inserted per the element newly arrived Bucket in, by the division to bucket and merge the variance sum near-optimization for make it that histogram is overall, the advantage of this method is significantly Reducing structure variance and optimizing histogrammic time complexity, but have the disadvantage to need to preserve all initial data can just treat point Split bucket and bucket to be combined carries out the calculating of variance, therefore the method is unfavorable for the dynamic structure in the case where limiting the streaming big data environment in space Build variance optimization histogram;At present, under distributed computing framework, it is proposed that one kind is using MapReduce Computational frames to general Data in rate database build approximating variances and optimize histogrammic method, but this method can only be counted for off-line data Calculate, it is impossible to which rapid build is carried out to online flow data, and currently a popular flow data calculating platform does not also provide calculating side The histogrammic method of difference optimization.
Accordingly, it is desirable to provide one kind can efficiently build variance under distributed stream data environment optimizes histogrammic method To meet widespread need of the variance optimization histogram under flow data environment.
The content of the invention
The present invention provides a kind of variance optimization histogram construction method and device based on Spark Streaming, its mesh Be that can build high-precision approximating variances optimization histogram in memory headroom by single pass data limiting.
The purpose of the present invention is realized using following technical proposals:
A kind of variance optimization histogram construction method based on Spark Streaming, it is theed improvement is that, including:
On-line sampling is carried out using Spark Streaming stream datas;
According to on-line sampling data, dynamic construction variance optimization histogram;
The variance optimization histogram is dynamically updated using the new data that add, and data dynamic construction again is added according to new Variance optimizes histogram.
It is preferred that, it is described to carry out on-line sampling using Spark Streaming stream datas, including:
RDD time interval is set, is converted to flow data according to time interval using Spark Streaming DStream structures, and window () operating parameter of RDD in DStream structures is called, the RDD in DStream structures is gathered Window is combined into, then on-line sampling is carried out to the data in each window;
Wherein, the time interval of window is the positive integer times of RDD time interval.
Further, the data in each window carry out on-line sampling, including:
At the beginning of the sample data of the sample space of window is set as into K, and will sample threshold value T, variable sum and variable count Beginning turns to 0, and greatest member number is initialized for K small top pile structure;
Data in the sample space of data storage in window to window, and in window are updated into the sampling threshold Value T;
Wherein, when the sample data of the sample space of window is K+1, selected from the sample data of the sample space of window Select the sample data ω for meeting constraintsiAnd ωj, and by sample data ωiAnd ωjThe small data accumulation of middle data value is to number According to being worth in big data, the sample data ω is deletediAnd ωjThe small data of middle data value.
Further, the constraints includes:Sample data ωiWith ωjSum is less than the sampling threshold value T, and sample Data ωiAnd ωjSampling cost it is minimum in the sample space of window;
Wherein, sample data ω is determined as the following formulaiAnd ωjSampling cost J:
In above formula, ωiFor i-th of sample data, ωjFor j-th of sample data, i<j;
Further, the data in window update the sampling threshold value T, including:
Judge whether the data value in window is less than sampling threshold value T;
If so, then the data value is added on variable sum, 1 is increased certainly with variations per hour count;
If it is not, then by the data storage into the small top pile structure;
Wherein, when the data amount check in small top pile structure reaches K, or the heap of small top pile structure serves as a fill-in the value of evidence and is less than and works as During preceding sampling threshold value T, then the value that heap serves as a fill-in evidence is added on variable sum, with variations per hour count from increasing 1, and by the heap top Data are deleted, and the value for the threshold value T that samples is updated into sum/count.
It is preferred that, the data according to on-line sampling, dynamic construction variance optimization histogram, including:
The maximum sample number that the sample space of set memory is allowed is KmaxAnd according to time interval setting variance optimization The number of histogrammic bucket is B;
With the K first reached in the sample space of internal memorymaxThe data of individual online acquisition build the wide Nogata that bucket number is B Figure, wherein, it is right to time interval institute where it to add the data of online acquisition according to the acquisition time of the data of online acquisition The variance answered optimizes in histogrammic bucket, and determines the variance of each bucket using the data value in each bucket, builds histogram.
Further, when the data of the sample space of internal memory are KmaxAt+1, according to the sampling time of the data newly added It is added into the variance to where it corresponding to time interval to optimize in histogrammic bucket, while updating the variance of this barrel, and selects The bucket of variance maximum in current variance optimization histogram is selected as bucket to be divided, the minimum bucket of variance and its adjacent variance are smaller A bucket as two buckets to be combined, if the variance of total data in two buckets to be combined is less than data in bucket to be divided Variance, then perform split degree operation, conversely, not performing any operation then;
When the data of the sample space of internal memory are Kmax+ 1, and the data newly added sampling time not variance optimize In the time zone of histogrammic bucket, then add it in the corresponding bucket of its sampling time nearest time zone, more The variance of this new barrel, and the bucket of variance maximum in current variance optimization histogram is selected as bucket to be divided, the minimum bucket of variance And its adjacent less bucket of variance is as two buckets to be combined, if data variance in two buckets to be combined and being less than The variance of data in bucket to be divided, then perform split degree operation, conversely, not performing any operation then.
It is preferred that, it is described dynamically to update the variance optimization histogram using the new data that add, including:
The variance built in set memory optimizes histogrammic time window, if newly arrived number in the sample space of internal memory Exceed the time window according to the corresponding sampling time, then compress the variance optimization histogram in the time window.
Further, the variance optimization histogram in the described pair of time window is compressed, including:
Retain the data boundary of each bucket in the variance optimization histogram in the time window, delete data boundary in each bucket Outer data, wherein, the data boundary of each bucket includes:Minimal sampling time number corresponding with the maximum sampling time in each bucket The average value of all data according to this and in this barrel.
A kind of variance optimization histogram construction device based on Spark Streaming, it is theed improvement is that, the dress Put including:
Sampling module, for carrying out on-line sampling using Spark Streaming stream datas;
Module is built, for according to on-line sampling data, dynamic construction variance optimization histogram;
Update module, for dynamically updating the variance optimization histogram using the new data that add, and adds number according to new Optimize histogram according to dynamic construction variance again.
It is preferred that, the sampling module, including:
RDD time interval is set, is converted to flow data according to time interval using Spark Streaming DStream structures, and window () operating parameter of RDD in DStream structures is called, RDD in DStream structures is polymerize For window, then on-line sampling is carried out to the data in each window;
Wherein, the time interval of window is the positive integer times of RDD time interval.
Further, the data in each window carry out on-line sampling, including:
At the beginning of the sample data of the sample space of window is set as into K, and will sample threshold value T, variable sum and variable count Beginning turns to 0, and greatest member number is initialized for K small top pile structure;
Data in the sample space of data storage in window to window, and in window are updated into the sampling threshold Value T;
Wherein, when the sample data of the sample space of window is K+1, selected from the sample data of the sample space of window Select the sample data ω for meeting constraintsiAnd ωj, and by sample data ωiAnd ωjThe small data accumulation of middle data value is to number According to being worth in big data, the sample data ω is deletediAnd ωjThe small data of middle data value.
Further, the constraints includes:Sample data ωiWith ωjSum is less than the sampling threshold value T, and sample Data ωiAnd ωjSampling cost it is minimum in the sample space of window;
Wherein, sample data ω is determined as the following formulaiAnd ωjSampling cost J:
In above formula, ωiFor i-th of sample data, ωjFor j-th of sample data, i<j;
Further, the data in window update the sampling threshold value T, including:
Judge whether the data value in window is less than sampling threshold value T;
If so, then the data value is added on variable sum, 1 is increased certainly with variations per hour count;
If it is not, then by the data storage into the small top pile structure;
Wherein, when the data amount check in small top pile structure reaches K, or the heap of small top pile structure serves as a fill-in the value of evidence and is less than and works as During preceding sampling threshold value T, then the value that heap serves as a fill-in evidence is added on variable sum, with variations per hour count from increasing 1, and by the heap top Data are deleted, and the value for updating sampling threshold value T is sum/count.
It is preferred that, the structure module, including:
The maximum sample number that the sample space of set memory is allowed is KmaxAnd according to time interval setting variance optimization The number of histogrammic bucket is B;
With the K first reached in the sample space of internal memorymaxThe data of individual online acquisition build the wide Nogata that bucket number is B Figure, wherein, it is right to time interval institute where it to add the data of online acquisition according to the acquisition time of the data of online acquisition The variance answered optimizes in histogrammic bucket, and determines the variance of each bucket using the data value in each bucket, builds histogram.
Further, when the data of the sample space of internal memory are KmaxAt+1, according to the sampling time of the data newly added It is added into the variance to where it corresponding to time interval to optimize in histogrammic bucket, while updating the variance of this barrel, and selects The bucket of variance maximum in current variance optimization histogram is selected as bucket to be divided, the minimum bucket of variance and its adjacent variance are smaller A bucket as two buckets to be combined, if the variance of total data in two buckets to be combined is less than data in bucket to be divided Variance, then perform split degree operation, conversely, not performing any operation then;
When the data of the sample space of internal memory are Kmax+ 1, and the data newly added sampling time not variance optimize In the time zone of histogrammic bucket, then add it in the corresponding bucket of its sampling time nearest time zone, more The variance of this new barrel, and the bucket of variance maximum in current variance optimization histogram is selected as bucket to be divided, the minimum bucket of variance And its adjacent less bucket of variance is as two buckets to be combined, if data variance in two buckets to be combined and being less than The variance of data in bucket to be divided, then perform split degree operation, conversely, not performing any operation then.
It is preferred that, the update module, including:
The variance built in set memory optimizes histogrammic time window, if newly arrived number in the sample space of internal memory Exceed the time window according to the corresponding sampling time, then compress the variance optimization histogram in the time window.
Further, the variance optimization histogram in the described pair of time window is compressed, including:
Retain the data boundary of each bucket in the variance optimization histogram in the time window, delete data boundary in each bucket Outer data, wherein, the data boundary of each bucket includes:Minimal sampling time number corresponding with the maximum sampling time in each bucket The average value of all data according to this and in this barrel.
With immediate prior art ratio, the technical scheme that the present invention is provided has the advantages that:
The technical scheme that the present invention is provided, solving existing method can not enter to the magnanimity of Unknown Distribution, high speed flow data The problem of efficient variance optimization histogram of row is built, polymerization that can be interval when limiting support high-precision real in memory headroom is looked into Ask, not only combine Spark Streaming height and handle up, low delay, support fault-tolerant characteristic, while designed method is to grasp The form for making operator is added in Spark Streaming calculating platforms, efficiently solves the structure side under distributed environment The histogrammic problem of difference optimization, by RDD conversion operations new under Spark Streaming, is also achieved in linear session Data in setting window are carried out with the optimization sampling of online variance;By in internal memory dynamic construction variance optimize it is histogrammic Method, variance optimization histogram can be built without complicated dynamic programming algorithm limiting in space;The technology that the present invention is provided Scheme realizes the variance under high speed magnanimity flow data environment and optimizes histogrammic Dynamic Maintenance and support interval in real time gather Inquiry is closed, with Spark Streaming height is fault-tolerant, low delay characteristic.
Brief description of the drawings
Fig. 1 is the flow chart that a kind of variance based on Spark Streaming of the present invention optimizes histogram construction method;
Fig. 2 is the overall flow that a kind of variance based on Spark Streaming of the present invention optimizes histogram construction method Figure;
Fig. 3 is using Spark Streaming stream datas to carry out online variance optimization sample streams in the embodiment of the present invention Cheng Tu;
Fig. 4 is dynamic construction variance optimization histogram method flow chart in internal memory in the embodiment of the present invention;
Fig. 5 is the structural representation that a kind of variance based on Spark Streaming of the present invention optimizes histogram construction device Figure.
Embodiment
The embodiment to the present invention elaborates below in conjunction with the accompanying drawings.
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is A part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art The all other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.
With the explosive growth of current network data, the real-time analytical technology of stream data becomes the popular neck of research Domain.Variance optimization histogram can return to the interval polymerization result of high accuracy limiting in space, have in data statistics, analysis field Extremely important with being widely applied.But, because variance optimizes the complexity that histogram is built, currently without a kind of flow data Computing system can provide efficient online variance optimization histogram construction method, and therefore, the present invention devises one kind and is based on Spark Streaming variance optimization histogram construction method, is realized under conditions of memory headroom is limited, to Unknown Distribution Out of order flow data carry out that the optimization of efficient variance is histogrammic to be built, and optimize histogram construction method and existing with traditional variance Some flow data approximating variances optimization histogram construction methods are contrasted, test result indicates that, this method is limiting internal memory sky It is interior that high-precision approximating variances optimization histogram can be built by single pass data, solve existing flow data and calculate System can not efficiently build variance and optimize histogrammic problem, and main contents of the present invention include three parts:1. in Spark Under Streaming environment, the data in DStream are carried out with online variance optimization and is sampled;2. sampled in each window of persistence Sample data afterwards uses sample data excellent with merging structure approximating variances by the Dynamic Division of histogram bucket into internal memory Change histogram;3. variance optimizes histogrammic Dynamic Maintenance in internal memory, and supports interval aggregate query in real time, as shown in figure 1, bag Include:
101. carry out on-line sampling using Spark Streaming stream datas;
102. according to on-line sampling data, dynamic construction variance optimization histogram;
103. dynamically updating the variance optimization histogram using the new data that add, and data dynamic again is added according to new Variance optimization histogram is built, i.e., returning to the step 102 according to new addition data, state builds variance optimization histogram again.
Specifically, the step 101, including:
RDD time interval is set, is converted to flow data according to time interval using Spark Streaming DStream structures, and window () operating parameter of RDD in DStream structures is called, RDD in DStream structures is polymerize For window, on-line sampling further is carried out to the data in each window;
Wherein, the time interval of window is the positive integer times of RDD time interval.
The data in each window carry out on-line sampling, including:
The sample data of sample space of window is set as K, initialization sampling threshold value T, variable sum and variable count are 0, and initialize the small top pile structure that greatest member number is K;
Data in the sample space of data storage in window to window, and in window are updated into the sampling threshold Value T;
Wherein, when the sample data of the sample space of window is K+1, selected from the sample data of the sample space of window Select the sample data ω for meeting constraintsiAnd ωj, and by sample data ωiAnd ωjThe small data accumulation of middle data value is to number According to being worth in big data, while deleting the sample data ωiAnd ωjThe small data of middle data value.
The constraints includes:Sample data ωiWith ωjSum is less than the sampling threshold value T, and sample data ωiWith ωjSampling cost it is minimum in the sample space of window;
Wherein, sample data ω is determined as the following formulaiAnd ωjSampling cost J:
In above formula, ωiFor i-th of sample data, ωjFor j-th of sample data, i<j;
The data in window update the sampling threshold value T, including:
Judge whether the data value in window is less than sampling threshold value T;
If so, then the data value is added on variable sum, 1 is increased certainly with variations per hour count;
If it is not, then by the data storage into the small top pile structure;
Wherein, when the data amount check in small top pile structure reaches K, or the heap of small top pile structure serves as a fill-in the value of evidence and is less than and works as During preceding sampling threshold value T, then the value that heap serves as a fill-in evidence is added on variable sum, with variations per hour count from increasing 1, and by the heap top Data are deleted, and the value for updating sampling threshold value T is sum/count.
The step 102, including:
The maximum sample number that the sample space of set memory is allowed is KmaxAnd according to time interval setting variance optimization The number of histogrammic bucket is B;
Utilize the K first reached in the sample space of internal memorymaxIt is the wide straight of B that the data of individual online acquisition, which build bucket number, Fang Tu, wherein, the data of online acquisition are added to time interval institute where it according to the acquisition time of the data of online acquisition Corresponding variance optimizes in histogrammic bucket, and determines the variance of each bucket using the data value in each bucket, builds histogram.
Wherein, when the data of the sample space of internal memory are KmaxAt+1, according to the data newly added sampling time by its Add to the variance corresponding to time interval where it and optimize in histogrammic bucket, while updating the variance of this barrel, and select to work as The maximum bucket of variance is used as bucket to be divided, the minimum bucket of variance and its adjacent variance less one in the difference optimization histogram of front Individual bucket is as two buckets to be combined, if the variance of the total data in two buckets to be combined is less than the side of data in bucket to be divided Difference, then perform split degree operation, if data variance and not less than data in bucket to be divided the side in two buckets to be combined Difference, then without any operation;
When the data of the sample space of internal memory are Kmax+ 1, and the data newly added sampling time not variance optimize In the time zone of histogrammic bucket, then add it in the corresponding bucket of its sampling time nearest time zone, more The variance of this new barrel, and the bucket of variance maximum in current variance optimization histogram is selected as bucket to be divided, the minimum bucket of variance And its adjacent less bucket of variance is as two buckets to be combined, if data variance in two buckets to be combined and being less than The variance of data in bucket to be divided, then perform split degree operation, if data variance in two buckets to be combined and being not less than The variance of data in bucket to be divided, then without any operation.
In the step 103, the variance optimization histogram is dynamically updated using the new data that add, including:
The variance built in set memory optimizes histogrammic time window, if newly arrived number in the sample space of internal memory Exceed the time window according to the corresponding sampling time, then the variance optimization histogram in the time window is compressed.
Wherein, the variance optimization histogram in the described pair of time window is compressed, including:
Retain the data boundary of each bucket in the variance optimization histogram in the time window, delete and number of boundary is removed in each bucket According to outer data, wherein, the data boundary of each bucket includes:Minimal sampling time is corresponding with the maximum sampling time in each bucket The average value of all data in data and this barrel.
A kind of variance optimization histogram construction device based on Spark Streaming, as shown in figure 5, described device bag Include:
Sampling module, for carrying out on-line sampling using Spark Streaming stream datas;
Module is built, for according to on-line sampling data, dynamic construction variance optimization histogram;
Update module, for dynamically updating the variance optimization histogram using the new data that add, and adds number according to new Optimize histogram according to dynamic construction variance again.
The sampling module, including:
RDD time interval is set, is converted to flow data according to time interval using Spark Streaming DStream structures, and window () operating parameter of RDD in DStream structures is called, RDD in DStream structures is polymerize For window, on-line sampling further is carried out to the data in each window;
Wherein, the time interval of window is the positive integer times of RDD time interval.
The data in each window carry out on-line sampling, including:
The sample data of sample space of window is set as K, initialization sampling threshold value T, variable sum and variable count are 0, and initialize the small top pile structure that greatest member number is K;
Data in the sample space of data storage in window to window, and in window are updated into the sampling threshold Value T;
Wherein, when the sample data of the sample space of window is K+1, selected from the sample data of the sample space of window Select the sample data ω for meeting constraintsiAnd ωj, and by sample data ωiAnd ωjThe small data accumulation of middle data value is to number According to being worth in big data, while deleting the sample data ωiAnd ωjThe small data of middle data value.
The constraints includes:Sample data ωiWith ωjSum is less than the sampling threshold value T, and sample data ωiWith ωjSampling cost it is minimum in the sample space of window;
Wherein, sample data ω is determined as the following formulaiAnd ωjSampling cost J:
In above formula, ωiFor i-th of sample data, ωjFor j-th of sample data, i<j;
The data in window update the sampling threshold value T, including:
Judge whether the data value in window is less than sampling threshold value T;
If so, then the data value is added on variable sum, 1 is increased certainly with variations per hour count;
If it is not, then by the data storage into the small top pile structure;
Wherein, when the data amount check in small top pile structure reaches K, or the heap of small top pile structure serves as a fill-in the value of evidence and is less than and works as During preceding sampling threshold value T, then the value that heap serves as a fill-in evidence is added on variable sum, with variations per hour count from increasing 1, and by the heap top Data are deleted, and the value for updating sampling threshold value T is sum/count.
The structure module, including:
The maximum sample number that the sample space of set memory is allowed is KmaxAnd according to time interval setting variance optimization The number of histogrammic bucket is B;
Utilize the K first reached in the sample space of internal memorymaxIt is the wide straight of B that the data of individual online acquisition, which build bucket number, Fang Tu, wherein, the data of online acquisition are added to time interval institute where it according to the acquisition time of the data of online acquisition Corresponding variance optimizes in histogrammic bucket, and determines the variance of each bucket using the data value in each bucket, builds histogram.
Wherein, when the data of the sample space of internal memory are KmaxAt+1, according to the data newly added sampling time by its Add to the variance corresponding to time interval where it and optimize in histogrammic bucket, while updating the variance of this barrel, and select to work as The maximum bucket of variance is used as bucket to be divided, the minimum bucket of variance and its adjacent variance less one in the difference optimization histogram of front Individual bucket is as two buckets to be combined, if the variance of the total data in two buckets to be combined is less than the side of data in bucket to be divided Difference, then perform split degree operation, if data variance and not less than data in bucket to be divided the side in two buckets to be combined Difference, then without any operation;
When the data of the sample space of internal memory are Kmax+ 1, and the data newly added sampling time not variance optimize In the time zone of histogrammic bucket, then add it in the corresponding bucket of its sampling time nearest time zone, more The variance of this new barrel, and the bucket of variance maximum in current variance optimization histogram is selected as bucket to be divided, the minimum bucket of variance And its adjacent less bucket of variance is as two buckets to be combined, if data variance in two buckets to be combined and being less than The variance of data in bucket to be divided, then perform split degree operation, if data variance in two buckets to be combined and being not less than The variance of data in bucket to be divided, then without any operation.
The update module, including:
The variance built in set memory optimizes histogrammic time window, if newly arrived number in the sample space of internal memory Exceed the time window according to the corresponding sampling time, then the variance optimization histogram in the time window is compressed.
Wherein, the variance optimization histogram in the described pair of time window is compressed, including:
Retain the data boundary of each bucket in the variance optimization histogram in the time window, delete and number of boundary is removed in each bucket According to outer data, wherein, the data boundary of each bucket includes:Minimal sampling time is corresponding with the maximum sampling time in each bucket The average value of all data in data and this barrel.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the application can be used in one or more computers for wherein including computer usable program code The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.
The application is the flow with reference to method, equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram are described.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
Finally it should be noted that:The above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, to the greatest extent The present invention is described in detail with reference to above-described embodiment for pipe, those of ordinary skills in the art should understand that:Still The embodiment of the present invention can be modified or equivalent substitution, and without departing from any of spirit and scope of the invention Modification or equivalent substitution, it all should cover within the claims of the present invention.

Claims (10)

1. a kind of variance optimization histogram construction method based on Spark Streaming, it is characterised in that methods described bag Include:
On-line sampling is carried out using Spark Streaming stream datas;
According to on-line sampling data, dynamic construction variance optimization histogram;
The variance optimization histogram is dynamically updated using the new data that add, and data dynamic construction variance again is added according to new Optimize histogram.
2. the method as described in claim 1, it is characterised in that described to be carried out using Spark Streaming stream datas Line is sampled, including:
RDD time interval is set, flow data is converted into DStream knots according to time interval using Spark Streaming Structure, and window () operating parameter of RDD in DStream structures is called, the RDD in DStream structures is polymerized to window, On-line sampling is carried out to the data in each window again;
Wherein, the time interval of window is the positive integer times of RDD time interval.
3. method as claimed in claim 2, it is characterised in that the data in each window carry out on-line sampling, including:
The sample data of the sample space of window is set as K, and will sampling threshold value T, variable sum and variable count initialization For 0, and greatest member number is initialized for K small top pile structure;
Data in the sample space of data storage in window to window, and in window are updated into the sampling threshold value T;
Wherein, when the sample data of the sample space of window is K+1, select full from the sample data of the sample space of window The sample data ω of sufficient constraintsiAnd ωj, and by sample data ωiAnd ωjThe small data accumulation of middle data value is to data value In big data, the sample data ω is deletediAnd ωjThe small data of middle data value;
It is preferred that, the constraints includes:Sample data ωiWith ωjSum is less than the sampling threshold value T, and sample data ωiAnd ωjSampling cost it is minimum in the sample space of window;
Wherein, sample data ω is determined as the following formulaiAnd ωjSampling cost J:
<mrow> <mi>J</mi> <mo>=</mo> <msub> <mi>&amp;omega;</mi> <mi>i</mi> </msub> <mo>&amp;CenterDot;</mo> <msub> <mi>&amp;omega;</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>&amp;CenterDot;</mo> <mo>(</mo> <mrow> <msub> <mi>&amp;omega;</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>&amp;omega;</mi> <mi>j</mi> </msub> </mrow> <mo>)</mo> <mo>+</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>m</mi> <mo>=</mo> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> <mrow> <mi>j</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <msub> <mi>&amp;omega;</mi> <mi>m</mi> </msub> <mo>)</mo> </mrow> </mrow>
In above formula, ωiFor i-th of sample data, ωjFor j-th of sample data, i<j;
It is preferred that, the data in window update the sampling threshold value T, including:
Judge whether the data value in window is less than sampling threshold value T;
If so, then the data value is added on variable sum, 1 is increased certainly with variations per hour count;
If it is not, then by the data storage into the small top pile structure;
Wherein, when the data amount check in small top pile structure reaches that K, or the value of the useful evidence of heap of small top pile structure are less than currently During sampling threshold value T, then the value that heap serves as a fill-in evidence is added on variable sum, with variations per hour count from increasing 1, and the heap is served as a fill-in into evidence Delete, the value for the threshold value T that samples is updated to sum/count.
4. the method as described in claim 1, it is characterised in that the data according to on-line sampling, dynamic construction variance is excellent Change histogram, including:
The maximum sample number that the sample space of set memory is allowed is KmaxAnd according to time interval setting variance optimization Nogata The number of the bucket of figure is B;
With the K first reached in the sample space of internal memorymaxThe data of individual online acquisition build the wide histogram that bucket number is B, its In, the data of online acquisition are added to the side corresponding to time interval where it according to the acquisition time of the data of online acquisition Difference optimizes in histogrammic bucket, and determines the variance of each bucket using the data value in each bucket, builds histogram;
It is preferred that, when the data of the sample space of internal memory are KmaxAt+1, added according to the sampling time of the data newly added Enter to the variance corresponding to time interval where it and optimize in histogrammic bucket, while updating the variance of this barrel, and select current The maximum bucket of variance is used as bucket to be divided, the minimum bucket of variance and its less one adjacent of variance in variance optimization histogram Bucket is as two buckets to be combined, if the variance of the total data in two buckets to be combined is less than the side of data in bucket to be divided Difference, then perform split degree operation, conversely, not performing any operation then;
When the data of the sample space of internal memory are Kmax+ 1, and the data newly added sampling time not variance optimize Nogata In the time zone of the bucket of figure, then add it in the corresponding bucket of its sampling time nearest time zone, updating should The variance of bucket, and select in current variance optimization histogram the maximum bucket of variance as bucket divide, the bucket of variance minimum and its The less bucket of adjacent variance is as two buckets to be combined, if data variance in two buckets to be combined and less than treating point The variance of data in bucket is split, then split degree operation is performed, conversely, not performing any operation then.
5. the method as described in claim 1, it is characterised in that described dynamically to update the variance optimization using the new data that add Histogram, including:
The variance built in set memory optimizes histogrammic time window, if newly arrived data pair in the sample space of internal memory The sampling time answered exceeds the time window, then compresses the variance optimization histogram in the time window;
It is preferred that, the variance optimization histogram compressed in the time window, including:
Retain the data boundary of each bucket in the variance optimization histogram in the time window, delete in each bucket outside data boundary Data, wherein, the data boundary of each bucket includes:In each bucket minimal sampling time data corresponding with the maximum sampling time with And in this barrel all data average value.
6. a kind of variance optimization histogram construction device based on Spark Streaming, it is characterised in that described device bag Include:
Sampling module, for carrying out on-line sampling using Spark Streaming stream datas;
Module is built, for according to on-line sampling data, dynamic construction variance optimization histogram;
Update module, for dynamically updating the variance optimization histogram using the new data that add, and adds data weight according to new New dynamic construction variance optimization histogram.
7. device as claimed in claim 6, it is characterised in that the sampling module, including:
RDD time interval is set, flow data is converted into DStream knots according to time interval using Spark Streaming Structure, and window () operating parameter of RDD in DStream structures is called, RDD in DStream structures is polymerized to window, then On-line sampling is carried out to the data in each window;
Wherein, the time interval of window is the positive integer times of RDD time interval.
8. device as claimed in claim 7, it is characterised in that the data in each window carry out on-line sampling, including:
The sample data of the sample space of window is set as K, and will sampling threshold value T, variable sum and variable count initialization For 0, and greatest member number is initialized for K small top pile structure;
Data in the sample space of data storage in window to window, and in window are updated into the sampling threshold value T;
Wherein, when the sample data of the sample space of window is K+1, select full from the sample data of the sample space of window The sample data ω of sufficient constraintsiAnd ωj, and by sample data ωiAnd ωjThe small data accumulation of middle data value is to data value In big data, the sample data ω is deletediAnd ωjThe small data of middle data value;
It is preferred that, the constraints includes:Sample data ωiWith ωjSum is less than the sampling threshold value T, and sample data ωiAnd ωjSampling cost it is minimum in the sample space of window;
Wherein, sample data ω is determined as the following formulaiAnd ωjSampling cost J:
<mrow> <mi>J</mi> <mo>=</mo> <msub> <mi>&amp;omega;</mi> <mi>i</mi> </msub> <mo>&amp;CenterDot;</mo> <msub> <mi>&amp;omega;</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>&amp;CenterDot;</mo> <mo>(</mo> <mrow> <msub> <mi>&amp;omega;</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>&amp;omega;</mi> <mi>j</mi> </msub> </mrow> <mo>)</mo> <mo>+</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>m</mi> <mo>=</mo> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> <mrow> <mi>j</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <msub> <mi>&amp;omega;</mi> <mi>m</mi> </msub> <mo>)</mo> </mrow> </mrow>
In above formula, ωiFor i-th of sample data, ωjFor j-th of sample data, i<j;
It is preferred that, the data in window update the sampling threshold value T, including:
Judge whether the data value in window is less than sampling threshold value T;
If so, then the data value is added on variable sum, 1 is increased certainly with variations per hour count;
If it is not, then by the data storage into the small top pile structure;
Wherein, when the data amount check in small top pile structure reaches that K, or the value of the useful evidence of heap of small top pile structure are less than currently During sampling threshold value T, then the value that heap serves as a fill-in evidence is added on variable sum, with variations per hour count from increasing 1, and the heap is served as a fill-in into evidence Delete, the value for updating sampling threshold value T is sum/count.
9. device as claimed in claim 6, it is characterised in that the structure module, including:
The maximum sample number that the sample space of set memory is allowed is KmaxAnd according to time interval setting variance optimization Nogata The number of the bucket of figure is B;
With the K first reached in the sample space of internal memorymaxThe data of individual online acquisition build the wide histogram that bucket number is B, its In, the data of online acquisition are added to the side corresponding to time interval where it according to the acquisition time of the data of online acquisition Difference optimizes in histogrammic bucket, and determines the variance of each bucket using the data value in each bucket, builds histogram;
It is preferred that, when the data of the sample space of internal memory are KmaxAt+1, added according to the sampling time of the data newly added Enter to the variance corresponding to time interval where it and optimize in histogrammic bucket, while updating the variance of this barrel, and select current The maximum bucket of variance is used as bucket to be divided, the minimum bucket of variance and its less one adjacent of variance in variance optimization histogram Bucket is as two buckets to be combined, if the variance of the total data in two buckets to be combined is less than the side of data in bucket to be divided Difference, then perform split degree operation, conversely, not performing any operation then;
When the data of the sample space of internal memory are Kmax+ 1, and the data newly added sampling time not variance optimize Nogata In the time zone of the bucket of figure, then add it in the corresponding bucket of its sampling time nearest time zone, updating should The variance of bucket, and select in current variance optimization histogram the maximum bucket of variance as bucket divide, the bucket of variance minimum and its The less bucket of adjacent variance is as two buckets to be combined, if data variance in two buckets to be combined and less than treating point The variance of data in bucket is split, then split degree operation is performed, conversely, not performing any operation then.
10. device as claimed in claim 6, it is characterised in that the update module, including:
The variance built in set memory optimizes histogrammic time window, if newly arrived data pair in the sample space of internal memory The sampling time answered exceeds the time window, then compresses the variance optimization histogram in the time window;
It is preferred that, the variance optimization histogram compressed in the time window, including:
Retain the data boundary of each bucket in the variance optimization histogram in the time window, delete in each bucket outside data boundary Data, wherein, the data boundary of each bucket includes:In each bucket minimal sampling time data corresponding with the maximum sampling time with And in this barrel all data average value.
CN201710212747.7A 2017-04-01 2017-04-01 A kind of variance optimization histogram construction method and device based on Spark Streaming Pending CN107193862A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710212747.7A CN107193862A (en) 2017-04-01 2017-04-01 A kind of variance optimization histogram construction method and device based on Spark Streaming

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710212747.7A CN107193862A (en) 2017-04-01 2017-04-01 A kind of variance optimization histogram construction method and device based on Spark Streaming

Publications (1)

Publication Number Publication Date
CN107193862A true CN107193862A (en) 2017-09-22

Family

ID=59871263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710212747.7A Pending CN107193862A (en) 2017-04-01 2017-04-01 A kind of variance optimization histogram construction method and device based on Spark Streaming

Country Status (1)

Country Link
CN (1) CN107193862A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362600A (en) * 2019-07-22 2019-10-22 广西大学 A kind of random ordering data flow distribution aggregate query method, system and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102103638A (en) * 2010-02-01 2011-06-22 北京大学 Query log-based database statistic data histogram generation method
US20120102377A1 (en) * 2010-10-26 2012-04-26 Krishnamurthy Viswanathan Method for constructing a histogram
CN104657450A (en) * 2015-02-05 2015-05-27 中国科学院信息工程研究所 Big data environment-oriented summary information dynamic constructing and querying method and device
CN105718872A (en) * 2016-01-15 2016-06-29 武汉光庭科技有限公司 Auxiliary method and system for rapid positioning of two-side lanes and detection of deflection angle of vehicle

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102103638A (en) * 2010-02-01 2011-06-22 北京大学 Query log-based database statistic data histogram generation method
US20120102377A1 (en) * 2010-10-26 2012-04-26 Krishnamurthy Viswanathan Method for constructing a histogram
CN104657450A (en) * 2015-02-05 2015-05-27 中国科学院信息工程研究所 Big data environment-oriented summary information dynamic constructing and querying method and device
CN105718872A (en) * 2016-01-15 2016-06-29 武汉光庭科技有限公司 Auxiliary method and system for rapid positioning of two-side lanes and detection of deflection angle of vehicle

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
夏小玲 等: "面向数据流的差分隐私直方图发布", 《计算机与现代化》 *
杨颖 等: "动态地构建和维护基于小波的直方图", 《计算机应用研究》 *
林子雨老师: "Spark入门:Spark Streaming简介", 《HTTP://DBLAB.XMU.EDU.CN/BLOG/1076-2/》 *
黄超 等: "时间序列数据流直方图构造方法研究", 《统计与决策》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362600A (en) * 2019-07-22 2019-10-22 广西大学 A kind of random ordering data flow distribution aggregate query method, system and medium
CN110362600B (en) * 2019-07-22 2022-03-11 广西大学 Out-of-order data stream distributed aggregation query method, system and medium

Similar Documents

Publication Publication Date Title
CN105227488B (en) A kind of network flow group scheduling method for distributed computer platforms
CN101252541B (en) Method for establishing network flow classified model and corresponding system thereof
CN104376365A (en) Method for constructing information system running rule libraries on basis of association rule mining
CN106777093A (en) Skyline inquiry systems based on space time series data stream application
CN110222029A (en) A kind of big data multidimensional analysis computational efficiency method for improving and system
CN103745273A (en) Semiconductor fabrication process multi-performance prediction method
CN101000605A (en) Intelligent two-stage compression method for process industrial historical data
CN106708989A (en) Spatial time sequence data stream application-based Skyline query method
CN108416465A (en) A kind of Workflow optimization method under mobile cloud environment
CN107644252A (en) A kind of recurrent neural networks model compression method of more mechanism mixing
CN106161135A (en) Business transaction failure analysis methods and device
CN114513470A (en) Network flow control method, device, equipment and computer readable storage medium
CN116827350A (en) Flexible work platform intelligent supervision method and system based on cloud edge cooperation
CN116050540A (en) Self-adaptive federal edge learning method based on joint bi-dimensional user scheduling
CN114781650A (en) Data processing method, device, equipment and storage medium
CN110263917A (en) A kind of neural network compression method and device
CN107193862A (en) A kind of variance optimization histogram construction method and device based on Spark Streaming
CN114169506A (en) Deep learning edge computing system framework based on industrial Internet of things platform
CN110309955A (en) A kind of non-load predicting method and device shut down when upgrading of cloud environment application system
CN114282658B (en) Method, device and medium for analyzing and predicting flow sequence
CN106909459A (en) A kind of method and device for adjusting connection pool
CN110191005A (en) A kind of alarm log processing method and system
CN109800271A (en) A kind of information collecting method based on big data
CN107704565A (en) Equivalent body generation method, device and system
CN110989040A (en) Artificial intelligent lightning approach early warning method and system based on slice processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20240119

AD01 Patent right deemed abandoned