CN109871362A - A kind of data compression method towards streaming time series data - Google Patents

A kind of data compression method towards streaming time series data Download PDF

Info

Publication number
CN109871362A
CN109871362A CN201910112962.9A CN201910112962A CN109871362A CN 109871362 A CN109871362 A CN 109871362A CN 201910112962 A CN201910112962 A CN 201910112962A CN 109871362 A CN109871362 A CN 109871362A
Authority
CN
China
Prior art keywords
data
compression
value
coding
observation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910112962.9A
Other languages
Chinese (zh)
Inventor
李建欣
李晨
司靖辉
韦冠宇
胡春明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201910112962.9A priority Critical patent/CN109871362A/en
Publication of CN109871362A publication Critical patent/CN109871362A/en
Pending legal-status Critical Current

Links

Abstract

The present invention proposes a kind of data compression method towards streaming time series data, which comprises the following steps: step 1, data cleansing, the data cleansing includes the missing values processing of data, outlier processing, then carries out type identification to data, obtains timestamp data and observation Value Data;Step 2, data compression, the timestamp data by after encapsulation carry out timestamp compression, and observation Value Data is observed Value Data compression;Step 3, the time stamp data compressed data and the observation data compression data carry out Variable Length Code;Step 4, data encapsulate, and the encapsulation is that data are stored in data file by the column compression of different types of data.

Description

A kind of data compression method towards streaming time series data
Technical field
The present invention relates to field of data compression, are mainly concerned with a kind of data compression method towards streaming time series data.
Background technique
Time series data is data a kind of very common and there are specific rule, represents a series of observation on time points Value obtains resulting data acquisition system, such as real-time measurement concentration, the real time price of stock or the production of PM2.5 by constant duration When the temperature sensor measurement value in workshop etc., i.e. time series data describe each in sometime range of measured main body Between point on measured value.In recent years, along with the development of Internet of Things, big data and artificial intelligence technology, the scale of time series data The growing trend that an explosion type is presented, by taking Baidu's unmanned vehicle as an example, collected data scale can reach each car daily 8TB.For this purpose, relevant enterprise needs constantly to add a large amount of storage equipment of erection to cope with ever-increasing data storage requirement. However, bigger, more memory capacity can occupy bigger system expenditure and energy consumption resource, huge cost burden is brought.Cause This, effective compression is carried out the characteristics of how towards time series data and is become with encapsulation as one in the exploitation of current time series database not Hold the problem and challenge ignored.
From the point of view of modeling, time series data mainly includes three piths, respectively based on, timestamp and measurement Value.Information due to being measured main body will not change in a short time, so the compression scheme towards time series data is main For time stamp data and corresponding measurement Value Data.Wherein, the major key that time stamp data is stored generally as data, acquisition Period in most cases has a fixed value, although the data break received is often influenced by transmission medium, But its fluctuation up and down still near a constant value;Observation Value Data is continuous acquisition in a certain time interval, front and back Correlation degree must be compared with discrete data more closely, it means that its variation would generally be smaller.The data compression of the prior art Mode can be roughly divided into lossless compression and lossy compression two major classes.Wherein common zlib or lz series in lossless compression method Algorithm directly compresses initial data, and realization is relatively simple, but since it fails the characteristics of utilizing time series data, causes It is not high for the compression ratio of time series data, is unable to satisfy corresponding storage demand;Common revolving door in lossy compression mode Algorithm realizes data compression by setting " dead band value " and " dead time " attribute, although its compression ratio is higher, due to It can damage the loss of precision to original information storage in some cases, thus be not suitable for yet degree of precision when Sequence data compression.
In addition, the prior art is when realizing the persistent storage of time series data, only data compression is not sufficient to ensure to count The requirement such as consistency according to storage.Meanwhile compression parameters also need rationally to dispose, to meet normal data retrieval, reading demand. Time series data processing technique is mainly for the treatment of high-frequency datas such as second grade, Milliseconds, and the data of these types are often in the short time It is interior to generate great data volume, therefore the storage of time series data needs while meeting corresponding required precision, most Bigization utilizes disk space, and the compression algorithm of the prior art is in the realization of above-mentioned aspect that the effect is unsatisfactory,
Summary of the invention
In view of the above problems, the invention proposes a kind of data encapsulation method towards streaming time series data, difference clock synchronization Between stamp data and observation Value Data carry out two different compress modes, and corresponding data block encapsulation format.In order to realize Above-mentioned goal of the invention, the present invention is the following steps are included: step 1, data cleansing, the data cleansing include the missing values of data Processing, outlier processing, then type identification is carried out to data, obtain timestamp data and observation Value Data;Step 2, data Compression, the timestamp data by after encapsulation carry out timestamp compression, and observation Value Data is observed Value Data compression; Step 3, the time stamp data compressed data and the observation data compression data carry out Variable Length Code;Step 4, data Encapsulation, the encapsulation are that data are stored in data file by the column compression of different types of data.Present invention has the advantage that Compression ratio is high, and compression/de-compression speed is fast, and data precision is high, and data consistency is secure, and local data's retrieval, reading are not necessarily to Decompress all data blocks.
Detailed description of the invention
Fig. 1 is overall flow figure of the invention;
Fig. 2 is that the deviation of storage is provided with the coding mode of variable length;
Fig. 3 is the Variable Length Code of observation data compression;
Fig. 4 is data file encapsulating structure figure.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.
The present invention is the following steps are included: step 1, data cleansing, the data cleansing include that the missing values of data handle, is different Constant value processing, then type identification is carried out to data, obtain timestamp data and observation Value Data;Step 2, data compression, institute It states and the timestamp data after encapsulation is subjected to timestamp compression, observation Value Data is observed Value Data compression;Step 3, The time stamp data compressed data and the observation data compression data carry out Variable Length Code;Step 4, data encapsulate, The encapsulation is that data are stored in data file by the column compression of different types of data.
Shown in reduced overall flow chart 1 of the invention.In this process, present invention streaming data first is located in advance Reason, missing values processing, outlier processing and sequence adjustment including data etc., so that it is guaranteed that it becomes before entering data buffer area Partial data according to time sequence;Then the present invention is pre-stored in data buffer storage according to corresponding data structure, to slow It deposits after the data volume in area reaches preset compression threshold, read lock is carried out to data buffer area, then carry out to data Timestamp data inlet time after encapsulation is stabbed data compression device, observation data flow by type identification and construction packages Enter observation data compression device;It is medium that it is stored to corresponding data buffer area again after data are flowed out from compression set To being further processed for upper layer module.
In timestamp compression in step 2, time stamp data t0,t1,…,tnIt is a monotonically increasing sequence.Due to The time interval of data acquisition sources is usually fixed, ti-1With tiBetween difference also substantially near this steady state value up and down Fluctuation, deviation are mainly derived from the transmission of data in the medium.Accordingly, with respect to entire time stamp data is recorded, only Deviation between retaining it and the theoretical value that is calculated according to consistent difference will save a large amount of memory space.And due to not It is that the present invention can obtain the time interval information of data acquisition equipment under all situations, the present invention is by all differences thus The mode that value is averaged fills up this information notch of consistent difference.
As shown in table 1, the present invention seeks the average time difference of entire cache blocks in advance, estimates between its continuous sampling It is divided into 1000ms.The following present invention seeks the difference between each time stamp data and theoretical time stamp data, such as 1537513051014 difference 14 is to subtract theoretical values by actual numerical value (1537513051014) (1537513051000) it obtains.After calculating all difference datas in this way, the information stored required for the present invention is just The time data of first row become 1537513050000,1000,14,2 this burst of data from table 1.
And in order to further enhance the effect of compression, the memory space of difference is reduced, the present invention sets the deviation of storage The coding mode for having set variable length is as shown in Figure 2.The concrete mode of Variable Length Code are as follows: step 1, input time stabs data and compiles The block length of code;The binary coding of step 2 calculating time stamp data;Step 3, judge whether the binary coding is point The integral multiple of group length, is if it is split coding according to block length, if otherwise carrying out preposition cover to coding, Then coding is split according to block length again;Step 4, judge whether segmentation is last group, is if it is jumped Step 5 is gone to, if it is not, then setting 0 for the advance sign position of the segment, is directed toward next leading portion, and continues to repeat step 4;Step 5,1 is set by the advance sign position that segmentation is last group;Step 6, the grouping of output time stamp data is compiled Code.
Specific example is as shown in Table 1, and the parameter L of fixed grouping is 3 here, i.e., mono- group of 3bit of mode.It is with " 14 " Example acquires it first and is represented in binary as 01110 (first is sign bit), then according to mono- group of 3bit of mode by its point Section has just obtained two groups of storage units 001 and 110 less than 3bit cover in front in this way.It is walked later into the setting of flag bit Suddenly, the present invention represents the tail portion grouping of group coding to indicate " 1 " to get to 0001 and 1110.It is provided with grouping mark The purpose of position is to be able to clearly determine the complete area of stored numerical value, convenient for decompression later.
Time Stamp(ms) Delta of Delta(ms) SDD Code
1537513050 000 - -
1537513051 014 14 0001 1110
1537513052 002 2 1010
Table 1
The last compression ratio of this method are as follows:
Wherein len is the storage size of respective digital, and f is the calculated storage size of timestamp compression algorithm function, and l is Each packet size of regular coding.
As the data that sensor or other equipment acquire, observation corresponding with timestamp also has certain continuous Property.However, difference between this continuous two sampled points and unstable, it is meant that take the variance of difference that will become very big.Cause This, proposes new Lossy Compression Algorithm according to the data characteristics of observation and relevant computation complexity.
The binary representation of the invention first for seeking observation Value Data, then carries out storage precision according to its precision Reduce work, this be based in IEEE 754 to the coding mode of 64 floating numbers.It can see according to its coding mode, only It can meet the needs of most of data precisions using the first half of its 52 mantissa codings, and bits of coded pair more rearward The precision of whole numerical value influences smaller, it might even be possible to ignore.For this purpose, the present invention carries out corresponding tail portion zero setting, by logarithm The mantissa coding position that value precision has little effect is 0:
Wherein zero_num is the number of mantissa's section zero setting in the case where not influencing to store precision, and n is to want to retain (value of such as 60.746888, n are 6) and l to observation data precisiondotFor the number of double type mantissa part, it is defaulted as 52。
To treated, binary coded data carries out further squeeze operation to the present invention later, by calculating itself and upper one The exclusive or value of data coding stores corresponding information gap.It is as shown in table 2 the part Loss&Xor of observation data compression, Double is classified as the original binary coding representation of data, and LossDouble is the coded representation carried out after the zero setting of tail portion, and XorValue is then the exclusive or difference with a upper data.Wherein when former data are 0, the present invention is arranged corresponding flag bit and comes It is marked, while setting 0xffffffffffffffff to carry out occupy-place, equally in decompression for XorValue data Also it can carry out skipping processing after output 0, not enter xor operation stream.
Srcdata Double LossDouble XorValue
60.746888 0x40445fbacb428912 0x40445fbacb000000 -
60.721392 0x40445c70c996b767 0x40445c70c9000000 0x000003ca02000000
60.743936 0x40445f609dcf893f 0x40445f609d000000 0x0000031054000000
60.743936 0x40445f609dcf893f 0x40445f609d000000 0x0
0.000000 - - 0xffffffffffffffff
Table 2
After calculating the exclusive or difference of data, the Variable Length Code mode of special designing is utilized to exclusive or in the present invention again Difference handle as shown in Figure 3.The step of processing are as follows: step 1, define control bit coding, starting position encodes, effectively Length coding;Step 2, judge whether the observation is 0, if it is, go to step 6, if it is not, then going to step 3; Step 3,1 being set by control bit coding first place, the position occurred according to data encoding first 1 calculates starting position coding, Significance bit length coding is calculated according to starting position and the last one 1 position occurred;Step 4, judge the observation Value Data Whether starting position coding is identical as the starting position coding of upper one group of data, if it is, control bit coding second is set It is set to 1, if it is not, then setting 0 for control bit second;Step 5, judge the observation effective length coding whether with it is upper The effective length of one group of data is identical, if it is, 1 is set by control bit coding third position, if not, control bit is encoded Third position is set as 0;Step 6, output control bit encodes Ctlbits, and starting position encodes Start, effective length coding Length。
Table 3 is the Variable Length Code of the observation data compression of a specific example, and wherein Ctlbits is whole mark Position, indicate respectively from 0-2 former data whether the initial position for being 0, first " 1 " whether with upper one unanimously, significance bit Whether length is consistent with upper one;Start refers to the position of first " 1 " in XOR result;Length refers to effective bit slice The length of section, i.e., first " 1 " part intermediate with the last one " 1 ";And Core Part is then significance bit segment.
Table 3
The data file being stored on disk is the final form of persistant data, and how to save data in the best way is also The content that must be taken into consideration.It is illustrated in figure 4 data file encapsulating structure figure, encapsulating structure is four parts: 1) being file first Head, wherein the contents such as the identification information of file, compressing file algorithm, compressing file parameter, file data index are mainly contained, 2) the key value of data, i.e. timestamp are contained, be compressed binary data, 3) contain the value value of data, that is, see Examine value, be compressed binary data, 4) contain the check value of all data such as file header, it is ensured that Document encapsulation data Consistency, reliability.The present invention arranges compression storing data the retrieval for making data in the data file by different types of data Become to be more easier with inquiry.Simultaneously in order to ensure the consistency and safety of data, the present invention is also devised for data file It is corresponding to check code policies.
Since time series data is compressed by variable-length method, final data file is also variable-length 's.In addition, the file header of each individually data file also includes segment index information to manage data.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify to technical solution documented by previous embodiment or equivalent replacement of some of the technical features;And These are modified or replaceed, the spirit and model of technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution It encloses.

Claims (6)

1. a kind of data compression method towards streaming time series data, which comprises the following steps: step 1, data are clear It washes, the data cleansing includes the missing values processing of data, outlier processing, then carries out type identification to data, obtains the time Flag data and observation Value Data;Step 2, data compression, the timestamp data by after encapsulation carry out timestamp compression, Observation Value Data is observed Value Data compression;Step 3, the time stamp data compressed data and the observation Value Data pressure Contracting data carry out Variable Length Code;Step 4, data encapsulate, and the encapsulation is to deposit data by the column compression of different types of data Storage is in data file.
2. the method as described in claim 1, which is characterized in that in the step 2, the time stamp data compresses specific Mode is to obtain the average time difference of entire cache blocks, is then obtained between each time stamp data and theoretical time stamp data Difference, then by the obtained difference summation be averaged.
3. method according to claim 2, which is characterized in that in the step 2, the observation data compression it is specific Mode is to seek the binary representation of observation Value Data, then needs to carry out binary-coded tail portion zero setting according to its precision, Then its exclusive or value encoded with a upper data is calculated.
4. method as claimed in claim 3, which is characterized in that the concrete mode of the tail portion zero setting is, mantissa's section zero setting Number are as follows:The values is the observation The numerical value of Value Data, the n are the precision for the observation Value Data to be retained, the ldouFor of double type mantissa part Number.
5. method as claimed in claim 4, the concrete mode of the Variable Length Code are as follows: step 1, input time stab data with The block length of coding;The binary coding of step 2 calculating time stamp data;Step 3, judge the binary coding whether be The integral multiple of block length is if it is split coding according to block length, if otherwise carrying out preposition benefit to coding Then position is again split coding according to block length;Step 4:, judge whether segmentation is last group, if it is It then gos to step 5, if it is not, then setting 0 for the advance sign position of the segment, is directed toward next leading portion, and continue to repeat Step 4;Step 5,1 is set by the advance sign position that segmentation is last group;Step 6, point of output time stamp data Group coding.
6. method as claimed in claim 5, the encapsulating structure is four parts, and first part is file header, wherein mainly Contain the contents such as identification information, compressing file algorithm, compressing file parameter, the file data index of file, second part packet The timestamp of data is contained, has been compressed binary data, Part III contains the observed value of data, is compressed two Binary data, Part IV contain the check value of all data.
CN201910112962.9A 2019-02-13 2019-02-13 A kind of data compression method towards streaming time series data Pending CN109871362A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910112962.9A CN109871362A (en) 2019-02-13 2019-02-13 A kind of data compression method towards streaming time series data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910112962.9A CN109871362A (en) 2019-02-13 2019-02-13 A kind of data compression method towards streaming time series data

Publications (1)

Publication Number Publication Date
CN109871362A true CN109871362A (en) 2019-06-11

Family

ID=66918700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910112962.9A Pending CN109871362A (en) 2019-02-13 2019-02-13 A kind of data compression method towards streaming time series data

Country Status (1)

Country Link
CN (1) CN109871362A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110504974A (en) * 2019-08-20 2019-11-26 北京四方继保自动化股份有限公司 D-PMU measurement data segmentation slice mixing compression and storage method and device
CN110688362A (en) * 2019-08-27 2020-01-14 浙江浙大中控信息技术有限公司 Data sectional type storage method based on time stamp
CN111181569A (en) * 2019-12-31 2020-05-19 山东信通电子股份有限公司 Compression method, device and equipment of time sequence data
CN111597225A (en) * 2020-04-21 2020-08-28 杭州安脉盛智能技术有限公司 Adaptive data reduction method based on segmented transient recognition
CN112152631A (en) * 2019-06-28 2020-12-29 杭州海康威视数字技术股份有限公司 Method and device for coding and decoding variable-length time string
CN110943797B (en) * 2019-12-18 2021-06-22 北京邮电大学 Data compression method in SDH network
WO2022001626A1 (en) * 2020-06-30 2022-01-06 华为技术有限公司 Time series data injection method, time series data query method and database system
CN115391355A (en) * 2022-10-26 2022-11-25 本原数据(北京)信息技术有限公司 Data processing method, device, equipment and storage medium
CN116069743A (en) * 2023-03-06 2023-05-05 齐鲁工业大学(山东省科学院) Fluid data compression method based on time sequence characteristics
CN116594572A (en) * 2023-07-17 2023-08-15 北京四维纵横数据技术有限公司 Floating point number stream data compression method, device, computer equipment and medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665213A (en) * 2016-07-29 2018-02-06 罗晓燕 A kind of power equipment online data processing method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665213A (en) * 2016-07-29 2018-02-06 罗晓燕 A kind of power equipment online data processing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN LI等: "FluteDB: An efficient and scalable in-memory time series database for sensor-cloud", 《JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112152631B (en) * 2019-06-28 2023-08-15 杭州海康威视数字技术股份有限公司 Method and device for encoding and decoding variable-length time string
CN112152631A (en) * 2019-06-28 2020-12-29 杭州海康威视数字技术股份有限公司 Method and device for coding and decoding variable-length time string
CN110504974A (en) * 2019-08-20 2019-11-26 北京四方继保自动化股份有限公司 D-PMU measurement data segmentation slice mixing compression and storage method and device
CN110504974B (en) * 2019-08-20 2023-10-27 北京四方继保自动化股份有限公司 D-PMU measurement data segmented slice hybrid compression storage method and device
CN110688362A (en) * 2019-08-27 2020-01-14 浙江浙大中控信息技术有限公司 Data sectional type storage method based on time stamp
CN110943797B (en) * 2019-12-18 2021-06-22 北京邮电大学 Data compression method in SDH network
CN111181569A (en) * 2019-12-31 2020-05-19 山东信通电子股份有限公司 Compression method, device and equipment of time sequence data
CN111181569B (en) * 2019-12-31 2021-06-15 山东信通电子股份有限公司 Compression method, device and equipment of time sequence data
CN111597225A (en) * 2020-04-21 2020-08-28 杭州安脉盛智能技术有限公司 Adaptive data reduction method based on segmented transient recognition
CN111597225B (en) * 2020-04-21 2023-10-27 杭州安脉盛智能技术有限公司 Self-adaptive data reduction method based on segmentation transient identification
WO2022001626A1 (en) * 2020-06-30 2022-01-06 华为技术有限公司 Time series data injection method, time series data query method and database system
CN115391355A (en) * 2022-10-26 2022-11-25 本原数据(北京)信息技术有限公司 Data processing method, device, equipment and storage medium
CN115391355B (en) * 2022-10-26 2023-01-17 本原数据(北京)信息技术有限公司 Data processing method, device, equipment and storage medium
CN116069743A (en) * 2023-03-06 2023-05-05 齐鲁工业大学(山东省科学院) Fluid data compression method based on time sequence characteristics
CN116594572A (en) * 2023-07-17 2023-08-15 北京四维纵横数据技术有限公司 Floating point number stream data compression method, device, computer equipment and medium
CN116594572B (en) * 2023-07-17 2023-09-19 北京四维纵横数据技术有限公司 Floating point number stream data compression method, device, computer equipment and medium

Similar Documents

Publication Publication Date Title
CN109871362A (en) A kind of data compression method towards streaming time series data
CN110021369B (en) Gene sequencing data compression and decompression method, system and computer readable medium
CN106202213B (en) FPGA binary file compression and decompression method and device
CN103236847A (en) Multilayer Hash structure and run coding-based lossless compression method for data
CN116681036B (en) Industrial data storage method based on digital twinning
CN115204754B (en) Heating power supply and demand information management platform based on big data
CN103258030A (en) Mobile device memory compression method based on dictionary encoding and run-length encoding
CN114697654B (en) Neural network quantization compression method and system
CN104410424B (en) The fast and lossless compression method of embedded device internal storage data
CN104125475B (en) Multi-dimensional quantum data compressing and uncompressing method and apparatus
CN116610265B (en) Data storage method of business information consultation system
CN116016606A (en) Sewage treatment operation and maintenance data efficient management system based on intelligent cloud
CN110021368B (en) Comparison type gene sequencing data compression method, system and computer readable medium
CN113312325B (en) Track data transmission method, device, equipment and storage medium
CN114697672B (en) Neural network quantization compression method and system based on run Cheng Quanling coding
CN116915873B (en) High-speed elevator operation data rapid transmission method based on Internet of things technology
CN115882867B (en) Data compression storage method based on big data
CN115567609B (en) Communication method of Internet of things for boiler
CN115695564B (en) Efficient transmission method of Internet of things data
CN107801031B (en) A kind of lossless compression-encoding method to pure three primary colors image data
CN112948639B (en) Unified storage management method and system for data of highway middling station
CN105631000A (en) Terminal-caching data compression method based on mobile terminal position characteristic information
CN115567058A (en) Time sequence data lossy compression method combining prediction and coding
CN114610792A (en) Data processing method, device and system and industrial equipment
CN114422802A (en) Self-encoder image compression method based on codebook

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190611

RJ01 Rejection of invention patent application after publication