CN114665884B

CN114665884B - Time sequence database self-adaptive lossy compression method, system and medium

Info

Publication number: CN114665884B
Application number: CN202210318623.8A
Authority: CN
Inventors: 王宏志; 姜楠; 郑博; 梁栋; 叶天生; 燕钰; 丁小欧
Original assignee: Beijing Nosi Spacetime Technology Co ltd; Harbin Institute of Technology
Current assignee: Beijing Nosi Spacetime Technology Co ltd; Harbin Institute of Technology
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2022-11-25
Anticipated expiration: 2042-03-29
Also published as: CN114665884A

Abstract

A time sequence database self-adaptive lossy compression method, a system and a medium relate to the technical field of computers, and aim at solving the problem that a method for improving a data compression ratio is lacked in the prior art, the method and the system are self-adaptive to the compression precision requirements of users. The user can determine the compression accuracy by storing the base and partial deviations of the data segments to ensure that the compression is within the respective accuracy. The data compression ratio of the database is high, and the storage space is saved. Lossy compression reduces accuracy, discards part of the deviations while maintaining the basis, and reduces storage space. And coding is carried out by using the similar idea of Huffman coding, so that the compression ratio is further improved. The coding mode is flexible. The encoding mode can be changed, the Huffman encoding can be inquired only by whole-section full decompression, and the efficiency can be improved by selecting and changing different encoding modes when the inquiry efficiency is low.

Description

Time sequence database self-adaptive lossy compression method, system and medium

Technical Field

The invention relates to the technical field of computers, in particular to a time sequence database self-adaptive lossy compression method, a time sequence database self-adaptive lossy compression system and a time sequence database self-adaptive lossy compression medium.

Background

In recent years, with the push of artificial intelligence, 5G, AIoT, and other technologies, the amount of global data is increasing indefinitely. The total amount of global data in 2018 was 33ZB, which reached about 45ZB in 2019. With this growing trend, data of 175ZB will be generated all year round by 2025. By 2020, 500 billion devices are clouded with data worldwide, covering many practical scenarios such as: smart life, smart cities, smart agriculture, and the like. With the rapid growth of intelligent machines, internet of things devices, and sensors to collect and transmit large amounts of measurement data, it becomes critical for data storage to consider data compression strategies. In many cases, the time series consists of high frequency floating composition point data. This data typically contains measurement noise, so lossy compression can provide significantly better compression without adversely affecting downstream applications. In some cases, lossy compression may improve the performance of downstream applications due to implicit denoising of data. Meanwhile, as for the time sequence database, the time sequence data has the characteristic of large data volume, and the timeliness of the data is reduced along with the lapse of time, so that the value of the data generated in the early stage is lower and lower, and the requirement on the compression ratio is higher and higher. Data compression studies have traditionally focused on increasing the compression ratio with the goal of bringing it close to the shannon entropy limit. Compression schemes that reach or approach the entropy limit are well known in this regard, some of the most typical examples being Huffman codes and general Lempel-Ziv (LZ) codes. However, these codes are not sufficient for all scenarios. Importantly, they must decompress all of the compressed data to retrieve a single bit, making these codes unsuitable for the requirements of modern data storage systems, particularly in time-series database scenarios where a suitable compression scheme is required to store the data in a manner that facilitates such use. Therefore, lossy compression techniques in time series databases are also becoming increasingly necessary.

In conclusion, the time sequence data in the time sequence database has low requirement on precision and high tolerance on errors in a plurality of industrial scenes; meanwhile, for data with an earlier generation time, because the query frequency is reduced, the database decompression speed is not so important in practical application, and the significance of improving the data compression ratio and saving the storage space becomes greater and greater. In such a real situation, a need has arisen for lossy compression techniques, particularly those that can adapt to error requirements.

Disclosure of Invention

The purpose of the invention is: aiming at the problem that a method for improving the data compression ratio is lacked in the prior art, a time sequence database self-adaptive lossy compression method, a time sequence database self-adaptive lossy compression system and a time sequence database self-adaptive lossy compression medium are provided.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the time sequence database self-adaptive lossy compression method comprises the following steps:

the method comprises the following steps: acquiring time sequence data to be compressed, and dividing the time sequence data to be compressed into different data blocks, wherein the data in the data blocks are not repeated with the data in other data blocks;

step two: deleting data point positions except for the precision requirement in each data block according to the precision requirement;

step three: regarding the data blocks processed in the second step, taking the data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating all the bases and the storage space required by the deviation;

step four: enabling i = i +1, executing the third step to perform iteration, stopping iteration if the current required storage space is larger than the last required storage space, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of the data point positions with the minimum correlation in each data block;

step five: the final basis and deviation are stored.

Further, the concrete steps of the fifth step are as follows:

step five, first: searching whether the same base exists in the base dictionary or not aiming at the obtained final base, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;

step two: traversing the time sequence data to be compressed aiming at the reserved base in the fifth step and obtaining the use times of the reserved base;

step five, step three: sequencing the reserved bases from small to large according to the using times, and then coding the bases sequenced from small to large according to the sequence from long ID coding to short ID coding;

step five and four: updating the base dictionary by using the base coded in the step three and the corresponding ID;

step five: the ID and the offset are stored.

Further, the specific step of dividing the time series data to be compressed into different data blocks in the step one is as follows:

time sequence data to be compressed is divided into data blocks without repetition in an iterative mode, and the error is 0.1%.

Further, the concrete steps of the fifth step are as follows:

and entropy coding is carried out on the final basis and the deviation, and the data after entropy coding is stored.

Further, the entropy encoding is: shannon coding, huffman coding or arithmetic coding.

Further, the entropy coding is huffman coding.

A time series database adaptive lossy compression system, comprising: the device comprises a data acquisition module, a data cutting module, a data processing module and a storage module;

the data acquisition module is used for acquiring time sequence data to be compressed;

the data cutting module is used for cutting time sequence data to be compressed into different data blocks, and data in the data blocks are not repeated with data in other data blocks;

the data processing module is used for determining the basis and the deviation in the data block aiming at the segmented data block, and the specific steps are as follows:

taking a data point with the minimum correlation in each data block as deviation, taking the rest parts as bases, representing the corresponding data block by using the bases and the deviation, if the corresponding data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating storage spaces required by all the bases and the deviation;

making i = i +1, performing iteration based on the steps, stopping iteration if the current required storage space is larger than the last required storage space, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of data points with minimum correlation in each data block;

the storage module stores the determined basis and the deviation.

Further, the storage module specifically executes the following steps:

step 1: aiming at the obtained final base, searching whether the same base exists in a base dictionary, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;

step 2: traversing the time sequence data to be compressed aiming at the reserved base in the step 1, and obtaining the use times of the reserved base;

and step 3: coding the short ID of the base with more use times, and coding the long ID of the base with less use times;

and 4, step 4: updating the base dictionary by using the base coded in the step 3 and the corresponding ID;

and 5: the ID and the offset are stored.

A computer-readable storage medium storing a time-series database adaptive lossy compression program, wherein when executed by a computer, the time-series database adaptive lossy compression program specifically performs the following steps:

step A: acquiring time sequence data to be compressed, and dividing the time sequence data to be compressed into different data blocks, wherein the data in the data blocks are not repeated with the data in other data blocks;

and B: deleting data point positions in each data block except for the precision requirement according to the precision requirement;

step C: for the data blocks processed in the step B, taking the data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating all the bases and the storage space required by the deviation;

step D: enabling i = i +1, executing the step C to iterate, if the current required storage space is larger than the last required storage space, stopping iteration, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of the data point positions with the minimum correlation in each data block;

step E: aiming at the obtained final base, searching whether the same base exists in a base dictionary, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;

step F: traversing the time sequence data to be compressed aiming at the base reserved in the step E, and obtaining the use times of the reserved base;

step G: carrying out short ID coding on the base with the large number of use times, and carrying out long ID coding on the base with the small number of use times;

step H: updating the base dictionary by using the base coded in the step G and the corresponding ID;

step I: the ID and the offset are stored.

The invention has the beneficial effects that:

the compression precision requirement of the user is self-adaptive. The user can determine the compression accuracy by storing the base and partial deviations of the data segments to ensure that the compression is within the respective accuracy.

The data compression ratio of the database is high, and the storage space is saved. Lossy compression reduces accuracy, discards part of the deviations while maintaining the basis, and reduces storage space. And coding is carried out by using the similar idea of Huffman coding, so that the compression ratio is further improved.

The coding mode is flexible. The encoding mode can be changed, the Huffman encoding can be inquired only by whole-section full decompression, and the efficiency can be improved by selecting and changing different encoding modes when the inquiry efficiency is low.

Drawings

FIG. 1 is a flow chart of the present application;

FIG. 2 is a schematic illustration of data segment basis and bias;

FIG. 3 is a schematic diagram of a time-series data partitioning process.

Detailed Description

It should be noted that, in the present invention, the embodiments disclosed in the present application may be combined with each other without conflict.

The first embodiment is as follows: specifically describing the present embodiment with reference to fig. 1, the time-series database adaptive lossy compression method according to the present embodiment includes the following steps:

step two: deleting data point positions in each data block except for the precision requirement according to the precision requirement;

step five: the final basis and deviation are stored.

The process of the time series data lossy compression technology is similar to the classic data de-duplication, when a user uploads a file, the file is divided into a plurality of blocks, the conversion function does not immediately store the blocks, and the conversion function is applied to each block to obtain the base sum deviation. And according to the self-adaptive error requirement, properly discarding the low-order deviation to obtain a storage scheme with a high compression ratio. The database applies repeated data deletion to the base, ensures that each base is only stored once, and the required deviation is stored beside the ID of the base, thereby being convenient for recovering the original block.

(1) Data block partitioning

The user inputs the precision required to be compressed, then divides the large segment of data into different data blocks, determines (maximizes) the base (same part) in each data block through an iterative method (as shown in fig. 3), and then reserves the deviation of the part except the base according to the precision requirement. The basis and deviation are shown in figure 2.

(2) Coding method

All data is traversed to count the number of times each base is used. This makes it possible to estimate the distribution of bases by calculating the frequency of each base for defining an entropy coding scheme of the base ID, i.e. encoding the common bases with fewer bits, which may further improve the overall compression capability, similar to the idea of Huffman coding, since less storage space is required for the base ID. However, it will affect the overall compression speed, so this step should only be used if the overall compression rate is more important than the query performance. In general, this step is best to improve compression if the distribution of the use of bases is far from uniform, since a large number of used bases will result in a short ID, while a small number of used bases will result in a long ID.

(3) Data compression

The time series data is cut and if the division into integer blocks cannot be performed, zero padding is used to form the last block. For each block, the base and offset are first identified, and then the ID of that particular base is found. If the base does not already exist, it is first added to the base dictionary and the next available integer is assigned. Once the ID is determined and the dictionary is updated, the next block will be processed until all the data is compressed.

According to the method and the device, data are compressed in a self-adaptive mode according to precision requirements, the lossy degree of lossy compression is controlled according to users and actual requirements, balance between the user requirements and database storage is achieved, the base and deviation of the stored data are utilized, the data are compressed to a large extent, the compression ratio is improved, the query efficiency of the database is kept, and in addition, similar Huffman coding ideas are used, the compression ratio is guaranteed, and the storage space is reduced.

The second embodiment is as follows: this embodiment is a further description of the first embodiment, and the difference between this embodiment and the first embodiment is that the specific step of the fifth step is:

step five and step three: sequencing the reserved bases from small to large according to the using times, and then coding the bases sequenced from small to large according to the sequence from long ID coding to short ID coding;

step five four: updating the base dictionary by using the base coded in the step three and the corresponding ID;

step five: the ID and the offset are stored.

The third concrete implementation mode: this embodiment is a further description of the first embodiment, and the difference between this embodiment and the first embodiment is that the specific step of dividing the time series data to be compressed into different data blocks in the first step is as follows:

The fourth concrete implementation mode: this embodiment is a further description of the first embodiment, and the difference between this embodiment and the first embodiment is that the specific step of the fifth step is:

The fifth concrete implementation mode: the present embodiment is further described with respect to a fourth embodiment, and the difference between the present embodiment and the fourth embodiment is that the entropy encoding is: shannon coding, huffman coding or arithmetic coding.

The sixth specific implementation mode: the present embodiment is further described with respect to a fifth embodiment, and the difference between the present embodiment and the fifth embodiment is that the entropy encoding is huffman encoding.

The seventh embodiment: the embodiment discloses a time sequence database self-adaptive lossy compression system, which comprises: the device comprises a data acquisition module, a data cutting module, a data processing module and a storage module;

the storage module stores the determined basis and the deviation.

The specific implementation mode eight: the present embodiment is a further description of a seventh embodiment, and the difference between the present embodiment and the seventh embodiment is that the storage module specifically executes the following steps:

step 1: searching whether the same base exists in the base dictionary or not aiming at the obtained final base, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;

and 2, step: traversing the time sequence data to be compressed aiming at the base reserved in the step 1, and obtaining the use times of the reserved base;

and 5: the ID and the offset are stored.

The specific implementation method nine: the embodiment discloses a computer-readable storage medium, in which a time-series database adaptive lossy compression program is stored, and when the time-series database adaptive lossy compression program is executed by a computer, the following steps are specifically executed:

and B, step B: deleting data point positions in each data block except for the precision requirement according to the precision requirement;

step C: for the data blocks processed in the step B, taking one data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the corresponding data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting the repeated bases, and finally calculating all the bases and the storage space required by the deviation;

step D: c, enabling i = i +1, executing the step C to iterate, if the current required storage space is larger than the last required storage space, stopping iteration, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of data point locations with the minimum correlation in each data block;

step E: searching whether the same base exists in the base dictionary or not aiming at the obtained final base, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;

step G: coding the short ID of the base with more use times, and coding the long ID of the base with less use times;

step I: the ID and the offset are stored.

Example (b):

(1) The configuration process of the self-adaptive data lossy compression algorithm comprises the following steps:

the method comprises the following steps: the user proposes the data compression ratio in the expected reasonable range and the applied compressed data range according to the characteristics or the needs of the data of the user, such as: certain data for a certain period of time.

Step two: and performing deviation quantity reservation decision according to the parameters transmitted by the user, selecting a certain quantity of deviation and feeding back the deviation to the expected effect of the user.

(2) Lossy data compression algorithm application process

The method comprises the following steps: and dividing the data segment. The amount of data to be concatenated depends on how much correlation between adjacent samples, and the base of each sample uses fewer bits than if it were not concatenated, thereby reducing the overhead of representing the correlation. At the same time, connecting too many samples will reduce the probability of matching bits, since they need to be equal in more bits, and therefore it is likely that there will be fewer matches overall. Therefore, by using an iterative method, the data of the whole file is divided into data segments without repetition, and the uniqueness of the data segments is ensured (corresponding errors can exist).

Step two: the basis and deviation are determined. Applying a transformation technique to the data segment to obtain a base and a bias, and if the bases of the plurality of blocks are the same, deleting the duplicate bases and only preserving the bias. And analyzing the required deviation quantity according to the user requirement condition, and discarding and storing part of the deviation when the precision range allows.

Step three: and (6) encoding data. And traversing the base and finding the using number of the base by using a method similar to Huffman coding, and performing coding storage based on the number.

It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations that fall within the spirit and scope of the invention be limited only by the claims and the description.

Claims

1. The self-adaptive lossy compression method of the time sequence database is characterized by comprising the following steps of:

step five: storing the final basis and the deviation;

the concrete steps of the fifth step are as follows:

step five two: traversing the time sequence data to be compressed aiming at the reserved base in the fifth step and obtaining the use times of the reserved base;

step five and four: updating the base dictionary by using the base coded in the step five and the corresponding ID;

step five: the ID and the offset are stored.

2. The adaptive lossy compression method for time-series databases of claim 1, wherein the step one of dividing the time-series data to be compressed into different data blocks comprises the following specific steps:

3. The adaptive lossy compression method for time series databases according to claim 1, wherein the concrete steps of the fifth step are:

4. The method of adaptive lossy compression for a time series database according to claim 3, wherein the entropy coding is: shannon coding, huffman coding or arithmetic coding.

5. The method of adaptive lossy compression for a time series database of claim 4, wherein the entropy coding is Huffman coding.

6. Time series database self-adaptation has a loss compression system, its characterized in that includes: the device comprises a data acquisition module, a data cutting module, a data processing module and a storage module;

taking a data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the corresponding data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating storage spaces required by all the bases and the deviations;

enabling i = i +1, performing iteration based on the steps, stopping iteration if the current required storage space is larger than the last required storage space, and taking the base sum deviation obtained last time as a final base sum deviation, wherein i represents the number of data point positions with minimum correlation in each data block;

the storage module stores the determined base and the deviation;

the storage module specifically executes the following steps:

and 2, step: traversing the time sequence data to be compressed aiming at the reserved base in the step 1, and obtaining the use times of the reserved base;

and 3, step 3: carrying out short ID coding on the base with the large number of use times, and carrying out long ID coding on the base with the small number of use times;

and 5: the ID and the offset are stored.

7. A computer-readable storage medium, wherein the medium stores a time-series database adaptive lossy compression program, and when the time-series database adaptive lossy compression program is executed by a computer, the computer-readable storage medium specifically executes the following steps:

and C: for the data blocks processed in the step B, taking the data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating all the bases and the storage space required by the deviation;

step I: the ID and the offset are stored.