CN104935348B

CN104935348B - A kind of controllable summary data compression method of estimation error

Info

Publication number: CN104935348B
Application number: CN201510254377.4A
Authority: CN
Inventors: 吴广君; 云晓春; 王树鹏
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2015-05-18
Filing date: 2015-05-18
Publication date: 2018-01-05
Anticipated expiration: 2035-05-18
Also published as: CN104935348A

Abstract

The invention discloses a kind of controllable summary data compression method of estimation error.This method is：1) a time tracker is established to the summary data of each object；For summary data to be written, the time-tracking device according to corresponding to navigating to object, then time-tracking device summary data is sampled and preserved in sample set corresponding to the time-tracking device；2) sample in the sample set of each time-tracking device is divided into multiple time phases and step-up error parameter；Then time-tracking device error parameter corresponding to samples to sample；3) sample set after processing is merged into a sample set H, the sample data in set H is then divided into multiple time phases and is written to according to corresponding error parameter sampling in the sample set of a new time-tracking device.Summary data not only linear lifting memory space, and remain able to support the approximate calculation that error limits after present invention compression.

Description

A kind of controllable summary data compression method of estimation error

Technical field

The invention belongs to areas of information technology, are constantly expanded for the approximate query system overview data under big data environment Application background, propose a kind of controllable summary data compression method of error.

Background technology

Big data analyzing and processing technology is had been widely used in every profession and trade at present, by analyzing the mass data in industry Resource, timely reliable solution is provided for upper-layer service.Approximate calculation is a kind of important in big data analysis process system Technological means, because approximate calculation needs the summary data than initial data much less, there is provided high-precision approximate calculation knot Fruit, it has been widely adopted in the types of applications of tolerance certain error.Such as united for the microblog data of large-scale microblogging website In the system such as meter, the clickstream data statistics of shopping website, transaction log Data stream statistics, approximate calculation has been not only able to effect To the data scale of magnanimity, while also provide high real-time decision support for upper-layer service.Approximate calculation is further at present Apply in the real-time affection computation (bibliography of network：H.Wang,D.Can,A.Kazemzadeh,F.Bar,and S.Narayanan,“A system for real-time twitter sentiment analysis of 2012u.s.presidential election cycle,”in Proceedings of the ACL2012System Demonstrations,ser.ACL’12.Stroudsburg,PA,USA:Association for Computational Linguistics, 2012, pp.115-120), economic data index prediction (bibliography：T.Preis,H.S.Moat,and E.H.Stanley,“Quantifying trading behavior in financial markets using Google Trends, " Sci.Rep., vol.3, p.1684,2013) and the field such as system for real-time intrusion detection in (reference：X.Yun, Y.Wang,Y.Zhang,and Y.Zhou,“A semantics-aware approach to the automated network protocol identification,”Networking,IEEE/ACM Transactions on,vol.PP, no.99,pp.1–1,2015)。

But the approximate query system under big data environment, face the problem of summary data constantly expands.I.e. with big number According to the drastically expansion of scale, the summary data scale that approximate query is relied on also constantly is increasing.Now face estimation precision With the contradictory problems between summary data amount.The approximate estimation precision provided is higher, and the summary data amount of required storage is just It is bigger.The big data approximate calculation technology being recently proposed, such as：Approximate top-k calculates (reference：J.Jestes, J.M.Phillips,F.Li,and M.Tang,“Ranking large temporal data,”Proc.VLDB Endow., Vol.5, no.11, pp.1412-1423, Jul.2012), approximate range-sum calculate (refer to X.Yun, G.Wu, G.Zhang, K.Li,and S.Wang,“Fastraq:A fast approach to range-aggregate queries in big data environments,”Cloud Computing,IEEE Transactions on,vol.PP,no.99,pp.1–1, 2014), the ordered set method of sampling (reference：E.Cohen,G.Cormode,and N.Duffield,“Structure-aware sampling:Flexible and accurate summarization,”Proceedings of the VLDB Endowment, vol.4, no.11,2011), and sliding window technique (reference：M.Datar,A.Gionis,P.Indyk, and R.Motwani,“Maintaining stream statistics over sliding windows:(extended abstract),”in Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser.SODA ' 02,2002, pp.635-644) etc., all do not account for summary data capacity and ask Topic.When the old and new's data use unified error parameter, if it is desired to obtain high-precision estimation result, then need to set relatively low Error parameter, now just need to safeguard more massive summary data.And for summary data long-term and being not frequently used, Preserve the obvious wasting space of fairly large summary data.Other solution methods also include using high-speed Medium, such as using SSD solid state hard discs, summary data is stored, on the basis of memory size is expanded, improves the access efficiency of summary data.But this Not only cost is higher for one resolving ideas, and still without solve big data environment under, the estimation precision of different summary datas and Contradictory problems between summary data amount.

The content of the invention

For technical problem present in prior art, it is an object of the invention to provide a kind of controllable general of estimation error Want data compression method.The characteristics of present invention is based on big data freshness sensitivity, propose the summary data compression side that error limits Method.The freshness sensitivity characteristic of big data can be described as follows：Any one object object in big data, sometime Point reaches at a high speed, then starts to propagate in related subject, after a couple of days or several weeks, gradually decays, finally withers away.This Invention combines the These characteristics of big data, to long-term existence, and the object being not frequently used, in two dimensions of key, time It is compressed, the not only linear lifting memory space, and remain able to support error to limit near of the summary data after compression Like calculating.

Big data typically has the dimension of two attributes of key and time, and the present invention is corresponding to propose the sum based on key dimensions Two kinds of summary data compression methods based on Time dimensions.From summary data be of little use and long-standing in compression process It is compressed, compresses the calculating logic that later summary data keeps former summary data, and lifting storage that can be linear is empty Between.Further, the compression method and process of summary data is discussed in detail specifically based on determining ripple Sampling techniques in the present invention.

It is of the invention to be for data item form<key,time,value>The summary data structure of data source is compressed, and is closed Key point overview is got up as follows：

1) compression method of the summary data in key dimensions is proposed.Summary data compression process in key dimensions is pin The process of the summary data of a single structure is compressed into summary data corresponding to multiple different key in a set.Pressure The later summary data that contracts can be directed to any key in set and provide estimation error average in set, utilize this method Realize the summary data compression process towards set；

2) the summary data compression method in time dimensions is proposed.Summary data compression process in time dimensions is root According to the time-sensitive feature of big data, new data sets relatively low relative error, and old data set larger relative error so that New data possesses higher computational accuracy, and long-standing summary data estimation precision is relatively low, and basis is matched somebody with somebody between the old and new's data The error parameter put realizes automatic conversion；

3) the concrete application example of above-mentioned summary data compression is provided based on determining ripple Sampling techniques, describes summary number According to structure and the specific compression process of summary data and the maintaining method of summary data in two dimensions of key and time, Effectively solve the contradictory problems between summary data estimation precision and the data volume under true streaming big data environment；

The technical scheme is that：

A kind of controllable summary data compression method of estimation error, its step are：

1) a time tracker tracker is established to each object object summary data；For summary to be written Data, the object object according to corresponding to summary data navigate to corresponding time-tracking device tracker, then time-tracking device Tracker is sampled to corresponding summary data according to the method for sampling that error limits and preserved to the time-tracking device In sample set corresponding to tracker；

2) by each time-tracking device tracker_iSample set i in sample when being divided into multiple on time dimension Between the stage and the error parameter of each time phase is set；Then time-tracking device tracker_iAccording to i-th of time phase pair The error parameter ξ answered_iThe sample of the i-th -1 time phase is sampled；

3) sample set after step 2) processing is merged into a sample set H, then by the sample in sample set H Notebook data is written to a time-tracking device tracker sequentially in time_newSample set in；, should wherein in ablation process Time-tracking device tracker_newMultiple time phases are divided on time dimension to the sample in sample set H and set The error parameter of each time phase；Then sample of the error parameter to corresponding time phase according to corresponding to each time phase Sampled.

Further, the method to set up of the error parameter of each time phase is：If i-th of time-tracking device tracker_i Sample set in j-th of time phase phase_i,jError be ξ_i,j, then ξ_i,j=r^h*ξ；Wherein, compression parameters r, and r> 1, ξ is the error parameter of first time phase, and h=(Tsmax-StartTs)/TL, TL is time phase phase_i,jTime Siding-to-siding block length, Tsmax are time-tracking device tracker_iMaximum time stamp, StartTs is time phase phase_i,jStarting Time.

Further, the error parameter of each time phase is identical.

Further, the time-tracking device tracker_iSample set in sample data structure be：<N,value, Ts>, wherein N is the value of current all write-in data polymerizing value, i.e.,value_iFor i-th of summary number According to value values, value be summary data in be used for count numerical value, Ts is the time caused by flag data.

Further, the number of samples of the maintenance in each layering isFor position more than m ' positions The sample put directly discards.

Further, the time-tracking device tracker_iSampled using the determination ripple method of sampling according to N values, and will The sample data of acquisition is placed on the time-tracking device tracker with the incremental order layering of timestamp_iSample set in.

Further, the method for sampling is：Determine the ripple method of sampling, the random wave method of sampling, stochastical sampling method or The index histogram method of sampling.

Compared with prior art, beneficial effects of the present invention are as follows：

1) the summary data compression processing method that error limits is proposed according to the characteristics of big data freshness sensitivity.Big data Under environment, in order to maintain high-precision approximate calculation, required storage and the summary data scale managed are often very huge. The present invention according to big data using having the characteristics of freshness sensitivity, from long-term existence and the summary data that is not frequently used, The fixed compression processing of the limits of error is carried out in two dimensions of key and time.With traditional approximate calculation system and summary data Maintaining method is compared, and carries out the processing of differentiation to sample data while estimation precision demand has been taken into account, and is not only effectively propped up The demand of high-precision approximate calculation has been held, while has significantly reduced the expense of summary data storage and maintenance.

By taking specific FS-Sketch as an example, specific summary data compression and maintaining method are given.Summary after compression Data still possess the application characteristic of original summary data, while the error parameter of approximate calculation result is controllable.With it is existing Method is compared, and can more be adapted to the types of applications in streaming big data with freshness sensitive features, effectively supports point inquiry, section The approximate calculation such as inquiry.

Brief description of the drawings

Fig. 1 is the compound summary institutional framework schematic diagram for supporting compression；

Fig. 2 is that time-based summary data compresses handling process；

Fig. 3 is the graph of a relation between summary data amount and relative error in compression process；

Fig. 4 is the time loss comparison diagram that summary data compresses under different error conditions.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with accompanying drawing, to according to this hair The Backup Data organization and management method of the level segmented of bright one embodiment is further described.

Big data application is general to have higher data throughput and the data scale of magnanimity simultaneously.Object in the present invention Object data item basic format is：Object:<key,value,Ts>.Wherein key：It is the unique identifier of data item； Value：Numerical attribute for statistics；Ts：Time caused by flag data, it is object time attribute.

Summary data structure is the kernel data structure in approximate calculation system.It is according to loading in big data system Data source, real-time servicing and the online data structure built.The inquiries such as inquiry such as sum, count, interval query can be supported Pattern.The present invention first by symmetrical wave sample based on specifically describe summary data structure (FS-Sketch), based on this structure be situated between The compression method that the summary data error that continues limits.Although the compression method of summary data is using specific summary data structure as case Example, but its method can be promoted the use of in other more general summary data structures.

1. support the compound summary data structure and its maintaining method of compression

It is a kind of compound summary data structure that can effectively manage two dimensions of key and time that the present invention provides first. Specifically, compound summary data structure needs the summary data for each object object to establish a time-tracking device (tracker).Value property values of the corresponding object in different time points is recorded by way of sampling in tracker. The compound summary data structure of one typical two dimension is as shown in Figure 1.

When data write first with the keyword of data item, using hash algorithm, specific tracker is positioned at, so The method of sampling limited afterwards in each tracker according to error carries out data sampling.When data query, using same Hash algorithm is positioned at specific tracker, and is estimated in the sample set of tracker, returns to approximate calculation result.Tool There is the compound summary data structure of features described above, can effectively manage the data of two dimensions of time and key, be a kind of open Basic Profile data structure.The different method of samplings, hash algorithm can apply with above-mentioned basic structure, corresponding to support Approximate calculation operates.The present invention introduces the compression method of summary data based on above-mentioned basic structure.

2. the summary data compression method based on time dimension

Summary data compression method based on time dimension is in time dimension the sample set in Fig. 1 in each tracker On degree, multiple time phases (phase) are further divided into, each time phase covers the time interval of certain length, if containing The time interval length of lid is TL.Different time phases can set different error parameters.Answered for freshness is sensitive For, newly arrived data general value is higher, and long-standing data typically should after certain damped cycle It is relatively low with being worth.Newly arrived data are written to first time phase, are sampled with error parameter ξ.When first When the time span that sample in time phase is covered reaches threshold value TL, with identical error parameter ξ create one it is new when Between the stage, receive the data that newly write.If handled without compression, system needs to safeguard whole samples that error parameter is ξ Notebook data.Summary data compression method core concept based on time dimension is, different time phases is according to the error of configuration Parameter, a certain amount of sample data is deleted from the sample set of old time phase, under conditions of estimation error is ensured, carried High storage efficiency.

If the compression parameters of summary data are r, and r>1, i-th of time phase is phase_i, its corresponding time range For [StartTs_i,EndTs_i], for some time phase phase_iIn simplify a certain amount of sample data, reach its error Parameter is r^h-1ξ, wherein h are the distance parameters away from the stage very first time, and parameter ξ is the error parameter of first time phase.With Compression parameters are that r summary data compression process is as shown in Figure 2.

Because data write-in always occurs in first time phase, the compression process of summary data can be with writing concurrently Perform, therefore the above method can effectively improve the write efficiency of streaming big data.

3. the summary data compression method based on key dimensions

Summary data compression method based on key dimensions is that multiple tracker to be compressed are formed a set, then The sample data recorded during this is gathered in each tracker is written in a newly-established tracker sequentially in time. The process of write-in is to update principle according to summary data, i.e., carries out double sampling with certain error parameter, can further All new write-in sample datas are organized into multiple time phases sequentially in time, further to support in step (2) Summary data compression based on time dimension.

If tracker to be combined is key_i~key_n, corresponding tracker collection is combined into { tracker_i|1<i<N }, new wound The tracker built is tracker_new, tracker_newError parameter be ξ_new.{ tracker_i|1<i<N } in record sample Notebook data, it is being written to tracker sequentially in time_newIn, the compound summary number that is provided in the ablation process of data such as (1) According to shown in structure.

The compression summary data structure based on key can be obtained.We, which provide, below compresses later summary data structure The estimation error estimation provided.If summary data set { tracker to be combined_i|1<i<N } in each tracker mistake Difference is identical, is ξ, error corresponding to the tracker newly created is ξ_new, then it is ξ to compress error corresponding to later summary data +ξ_new+ξ*ξ_new, this value can approximately be equal to ξ+ξ_new。

As ξ=ξ_new{ the tracker merged can be directed to by compressing later summary data_i|1<i<N } polymerizing value provide 2 times of approximate estimation results of former error.

Specific embodiment

The above method gives the summary data compression process for supporting that error limits.The above method is based on a kind of open Compound summary data illustrate, the above method and conclusion are applied to some current conventional summary data structures, such as： Determine a variety of summary data structures based on ripple sampling, random wave sampling, stochastical sampling and index histogram etc..In order to prominent Go out the availability of method, ripple sampling algorithm, the embodiment for providing the above method are determined with reference to specific.

(1) building process of compound summary data structure is realized

It is the method for sampling of generally use in flow data sliding window to determine ripple Sampling techniques.Needed in each tracker Record data item<N,value,Ts>, wherein N is the value of current all write-in data polymerizing value, i.e., Ts is the time caused by flag data.Determine that ripple is sampled according to N, the sample data of acquisition is that layering is placed.If sample is put The number of plies level put is ln, and the data newly write are subsequently placed in ln layers so that timestamp is incremental.

(2) compression process of the summary data based on time dimensions is realized

It is determined that in ripple sampling, the number of samples of the maintenance of each level isIt is individual, when sample data surpasses When crossing this number, the older sample data of the timestamp of each level is discarded.According to the compression method based on time attribute, New error parameter (ξ ') can be used to calculate the maximum sample number in current levelSurpass for position The sample for crossing m ' positions directly discards, and realizes limit error sample compression, and from calculation formula, this process is in sample data Measure it is larger be can be linear lifting data space.

(3) compression process of the summary data based on key dimensions is realized

Compression method based on key, tracker similar in long-term no and polymerizing value can be selected to build collection to be compressed Close { tracker_i|1<i<n}.Then the data in all tracker are sequentially played back to one according to sample time attribute Ts The individual tracker newly built_newIt is interior.tracker_newKey1~keyn polymerization can be supported to estimate, estimation error meets analysis As a result.

Experimental data and conclusion

It is as follows according to present disclosure contrived experiment：To the network web sites access log page of wikipedia issue View is as test data set, the experimental selection web log of 8 days, nearly 90GB original data volume.Analyzed in experiment general The efficiency for wanting the compression time expense of data to be lifted with memory space after compression.

Fig. 3 provides the time loss of summary data compression.It can be drawn by diagram, for test data 90GB, produced Summary data 1GB test environment under, summary data space using sample number represent.When summary data compression process In, when relative error changes to 0.45 by 0.05, number of samples is reduced to 1.6M from 14M, and relative error is presented with sample size Go out a kind of variation relation of approximately linear.Fig. 4 further provides time overhead in the process, can be drawn by diagram general The data compression process time is wanted as average out to 150ms or so, therefore summary data compression method proposed by the present invention, can be efficient Operate in the approximate calculation system under big data environment, while ensure that calculation error, can linear lifting deposit Store up space availability ratio.

Described above is only the general introduction of technical solution of the present invention, but it can not be limiting the present invention.Belonging to the present invention Those of ordinary skill in technical field, without departing from the spirit and scope of the present invention, a little change and modification are done, is all existed In protection scope of the present invention.Therefore protection scope of the present invention is when as defined in claim.

Claims

1. a kind of controllable summary data compression method of estimation error, its step are：

1) a time tracker tracker is established to each object object summary data；For summary data to be written, The object object according to corresponding to summary data navigates to corresponding time-tracking device tracker, then time-tracking device Tracker is sampled to corresponding summary data according to the method for sampling that error limits and preserved to the time-tracking device In sample set corresponding to tracker；

2) by each time-tracking device tracker_iSample set i in sample multiple time ranks are divided on time dimension Section simultaneously sets the error parameter of each time phase；Then time-tracking device tracker_iAccording to corresponding to i-th of time phase Error parameter ξ_iThe sample of the i-th -1 time phase is sampled；

3) sample set after step 2) processing is merged into a sample set H, then by the sample number in sample set H According to being written to a time-tracking device tracker sequentially in time_newSample set in；Wherein in ablation process, the time Tracker tracker_newMultiple time phases are divided on time dimension to the sample in sample set H and set each The error parameter of time phase；Then error parameter is carried out to the sample of corresponding time phase according to corresponding to each time phase Sampling.

2. the method as described in claim 1, it is characterised in that the method to set up of the error parameter of each time phase is：If I-th of time-tracking device tracker_iSample set in j-th of time phase phase_i,jError be ξ_i,j, then ξ_i,j=r^h* ξ；Wherein, compression parameters r, and r>1, ξ is the error parameter of first time phase, h=(Tsmax-StartTs)/TL, TL For time phase phase_i,jTime interval length, Tsmax is time-tracking device tracker_iMaximum time stamp, StartTs For time phase phase_i,jInitial time.

3. the method as described in claim 1, it is characterised in that the error parameter of each time phase is identical.

4. the method as described in claim 1 or 2 or 3, it is characterised in that the time-tracking device tracker_iSample set In sample data structure be：<N,value,Ts>, wherein N is the value of current all write-in data polymerizing value, i.e.,value_iFor the value values of i-th of summary data, value is for the numerical value counted, Ts in summary data For the time caused by flag data.

5. method as claimed in claim 4, it is characterised in that the number of samples of the maintenance in each layering isDirectly discarded for sample of the position more than m ' positions.

6. method as claimed in claim 5, it is characterised in that the time-tracking device tracker_iUsing determination ripple sampling side Method is sampled according to N values, and the sample data of acquisition is placed on into the time-tracking with timestamp incremental order layering Device tracker_iSample set in.

7. the method as described in claim 1 or 2 or 3, it is characterised in that the method for sampling is：Determine the ripple method of sampling, with The machine ripple method of sampling, stochastical sampling method or the index histogram method of sampling.