CN114665884B - Time sequence database self-adaptive lossy compression method, system and medium - Google Patents

Time sequence database self-adaptive lossy compression method, system and medium Download PDF

Info

Publication number
CN114665884B
CN114665884B CN202210318623.8A CN202210318623A CN114665884B CN 114665884 B CN114665884 B CN 114665884B CN 202210318623 A CN202210318623 A CN 202210318623A CN 114665884 B CN114665884 B CN 114665884B
Authority
CN
China
Prior art keywords
base
data
bases
deviation
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210318623.8A
Other languages
Chinese (zh)
Other versions
CN114665884A (en
Inventor
王宏志
姜楠
郑博
梁栋
叶天生
燕钰
丁小欧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Nosi Spacetime Technology Co ltd
Harbin Institute of Technology
Original Assignee
Beijing Nosi Spacetime Technology Co ltd
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Nosi Spacetime Technology Co ltd, Harbin Institute of Technology filed Critical Beijing Nosi Spacetime Technology Co ltd
Priority to CN202210318623.8A priority Critical patent/CN114665884B/en
Publication of CN114665884A publication Critical patent/CN114665884A/en
Application granted granted Critical
Publication of CN114665884B publication Critical patent/CN114665884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3059Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code

Abstract

A time sequence database self-adaptive lossy compression method, a system and a medium relate to the technical field of computers, and aim at solving the problem that a method for improving a data compression ratio is lacked in the prior art, the method and the system are self-adaptive to the compression precision requirements of users. The user can determine the compression accuracy by storing the base and partial deviations of the data segments to ensure that the compression is within the respective accuracy. The data compression ratio of the database is high, and the storage space is saved. Lossy compression reduces accuracy, discards part of the deviations while maintaining the basis, and reduces storage space. And coding is carried out by using the similar idea of Huffman coding, so that the compression ratio is further improved. The coding mode is flexible. The encoding mode can be changed, the Huffman encoding can be inquired only by whole-section full decompression, and the efficiency can be improved by selecting and changing different encoding modes when the inquiry efficiency is low.

Description

Time sequence database self-adaptive lossy compression method, system and medium
Technical Field
The invention relates to the technical field of computers, in particular to a time sequence database self-adaptive lossy compression method, a time sequence database self-adaptive lossy compression system and a time sequence database self-adaptive lossy compression medium.
Background
In recent years, with the push of artificial intelligence, 5G, AIoT, and other technologies, the amount of global data is increasing indefinitely. The total amount of global data in 2018 was 33ZB, which reached about 45ZB in 2019. With this growing trend, data of 175ZB will be generated all year round by 2025. By 2020, 500 billion devices are clouded with data worldwide, covering many practical scenarios such as: smart life, smart cities, smart agriculture, and the like. With the rapid growth of intelligent machines, internet of things devices, and sensors to collect and transmit large amounts of measurement data, it becomes critical for data storage to consider data compression strategies. In many cases, the time series consists of high frequency floating composition point data. This data typically contains measurement noise, so lossy compression can provide significantly better compression without adversely affecting downstream applications. In some cases, lossy compression may improve the performance of downstream applications due to implicit denoising of data. Meanwhile, as for the time sequence database, the time sequence data has the characteristic of large data volume, and the timeliness of the data is reduced along with the lapse of time, so that the value of the data generated in the early stage is lower and lower, and the requirement on the compression ratio is higher and higher. Data compression studies have traditionally focused on increasing the compression ratio with the goal of bringing it close to the shannon entropy limit. Compression schemes that reach or approach the entropy limit are well known in this regard, some of the most typical examples being Huffman codes and general Lempel-Ziv (LZ) codes. However, these codes are not sufficient for all scenarios. Importantly, they must decompress all of the compressed data to retrieve a single bit, making these codes unsuitable for the requirements of modern data storage systems, particularly in time-series database scenarios where a suitable compression scheme is required to store the data in a manner that facilitates such use. Therefore, lossy compression techniques in time series databases are also becoming increasingly necessary.
In conclusion, the time sequence data in the time sequence database has low requirement on precision and high tolerance on errors in a plurality of industrial scenes; meanwhile, for data with an earlier generation time, because the query frequency is reduced, the database decompression speed is not so important in practical application, and the significance of improving the data compression ratio and saving the storage space becomes greater and greater. In such a real situation, a need has arisen for lossy compression techniques, particularly those that can adapt to error requirements.
Disclosure of Invention
The purpose of the invention is: aiming at the problem that a method for improving the data compression ratio is lacked in the prior art, a time sequence database self-adaptive lossy compression method, a time sequence database self-adaptive lossy compression system and a time sequence database self-adaptive lossy compression medium are provided.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the time sequence database self-adaptive lossy compression method comprises the following steps:
the method comprises the following steps: acquiring time sequence data to be compressed, and dividing the time sequence data to be compressed into different data blocks, wherein the data in the data blocks are not repeated with the data in other data blocks;
step two: deleting data point positions except for the precision requirement in each data block according to the precision requirement;
step three: regarding the data blocks processed in the second step, taking the data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating all the bases and the storage space required by the deviation;
step four: enabling i = i +1, executing the third step to perform iteration, stopping iteration if the current required storage space is larger than the last required storage space, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of the data point positions with the minimum correlation in each data block;
step five: the final basis and deviation are stored.
Further, the concrete steps of the fifth step are as follows:
step five, first: searching whether the same base exists in the base dictionary or not aiming at the obtained final base, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;
step two: traversing the time sequence data to be compressed aiming at the reserved base in the fifth step and obtaining the use times of the reserved base;
step five, step three: sequencing the reserved bases from small to large according to the using times, and then coding the bases sequenced from small to large according to the sequence from long ID coding to short ID coding;
step five and four: updating the base dictionary by using the base coded in the step three and the corresponding ID;
step five: the ID and the offset are stored.
Further, the specific step of dividing the time series data to be compressed into different data blocks in the step one is as follows:
time sequence data to be compressed is divided into data blocks without repetition in an iterative mode, and the error is 0.1%.
Further, the concrete steps of the fifth step are as follows:
and entropy coding is carried out on the final basis and the deviation, and the data after entropy coding is stored.
Further, the entropy encoding is: shannon coding, huffman coding or arithmetic coding.
Further, the entropy coding is huffman coding.
A time series database adaptive lossy compression system, comprising: the device comprises a data acquisition module, a data cutting module, a data processing module and a storage module;
the data acquisition module is used for acquiring time sequence data to be compressed;
the data cutting module is used for cutting time sequence data to be compressed into different data blocks, and data in the data blocks are not repeated with data in other data blocks;
the data processing module is used for determining the basis and the deviation in the data block aiming at the segmented data block, and the specific steps are as follows:
taking a data point with the minimum correlation in each data block as deviation, taking the rest parts as bases, representing the corresponding data block by using the bases and the deviation, if the corresponding data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating storage spaces required by all the bases and the deviation;
making i = i +1, performing iteration based on the steps, stopping iteration if the current required storage space is larger than the last required storage space, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of data points with minimum correlation in each data block;
the storage module stores the determined basis and the deviation.
Further, the storage module specifically executes the following steps:
step 1: aiming at the obtained final base, searching whether the same base exists in a base dictionary, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;
step 2: traversing the time sequence data to be compressed aiming at the reserved base in the step 1, and obtaining the use times of the reserved base;
and step 3: coding the short ID of the base with more use times, and coding the long ID of the base with less use times;
and 4, step 4: updating the base dictionary by using the base coded in the step 3 and the corresponding ID;
and 5: the ID and the offset are stored.
A computer-readable storage medium storing a time-series database adaptive lossy compression program, wherein when executed by a computer, the time-series database adaptive lossy compression program specifically performs the following steps:
step A: acquiring time sequence data to be compressed, and dividing the time sequence data to be compressed into different data blocks, wherein the data in the data blocks are not repeated with the data in other data blocks;
and B: deleting data point positions in each data block except for the precision requirement according to the precision requirement;
step C: for the data blocks processed in the step B, taking the data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating all the bases and the storage space required by the deviation;
step D: enabling i = i +1, executing the step C to iterate, if the current required storage space is larger than the last required storage space, stopping iteration, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of the data point positions with the minimum correlation in each data block;
step E: aiming at the obtained final base, searching whether the same base exists in a base dictionary, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;
step F: traversing the time sequence data to be compressed aiming at the base reserved in the step E, and obtaining the use times of the reserved base;
step G: carrying out short ID coding on the base with the large number of use times, and carrying out long ID coding on the base with the small number of use times;
step H: updating the base dictionary by using the base coded in the step G and the corresponding ID;
step I: the ID and the offset are stored.
The invention has the beneficial effects that:
the compression precision requirement of the user is self-adaptive. The user can determine the compression accuracy by storing the base and partial deviations of the data segments to ensure that the compression is within the respective accuracy.
The data compression ratio of the database is high, and the storage space is saved. Lossy compression reduces accuracy, discards part of the deviations while maintaining the basis, and reduces storage space. And coding is carried out by using the similar idea of Huffman coding, so that the compression ratio is further improved.
The coding mode is flexible. The encoding mode can be changed, the Huffman encoding can be inquired only by whole-section full decompression, and the efficiency can be improved by selecting and changing different encoding modes when the inquiry efficiency is low.
Drawings
FIG. 1 is a flow chart of the present application;
FIG. 2 is a schematic illustration of data segment basis and bias;
FIG. 3 is a schematic diagram of a time-series data partitioning process.
Detailed Description
It should be noted that, in the present invention, the embodiments disclosed in the present application may be combined with each other without conflict.
The first embodiment is as follows: specifically describing the present embodiment with reference to fig. 1, the time-series database adaptive lossy compression method according to the present embodiment includes the following steps:
the method comprises the following steps: acquiring time sequence data to be compressed, and dividing the time sequence data to be compressed into different data blocks, wherein the data in the data blocks are not repeated with the data in other data blocks;
step two: deleting data point positions in each data block except for the precision requirement according to the precision requirement;
step three: regarding the data blocks processed in the second step, taking the data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating all the bases and the storage space required by the deviation;
step four: enabling i = i +1, executing the third step to perform iteration, stopping iteration if the current required storage space is larger than the last required storage space, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of the data point positions with the minimum correlation in each data block;
step five: the final basis and deviation are stored.
The process of the time series data lossy compression technology is similar to the classic data de-duplication, when a user uploads a file, the file is divided into a plurality of blocks, the conversion function does not immediately store the blocks, and the conversion function is applied to each block to obtain the base sum deviation. And according to the self-adaptive error requirement, properly discarding the low-order deviation to obtain a storage scheme with a high compression ratio. The database applies repeated data deletion to the base, ensures that each base is only stored once, and the required deviation is stored beside the ID of the base, thereby being convenient for recovering the original block.
(1) Data block partitioning
The user inputs the precision required to be compressed, then divides the large segment of data into different data blocks, determines (maximizes) the base (same part) in each data block through an iterative method (as shown in fig. 3), and then reserves the deviation of the part except the base according to the precision requirement. The basis and deviation are shown in figure 2.
(2) Coding method
All data is traversed to count the number of times each base is used. This makes it possible to estimate the distribution of bases by calculating the frequency of each base for defining an entropy coding scheme of the base ID, i.e. encoding the common bases with fewer bits, which may further improve the overall compression capability, similar to the idea of Huffman coding, since less storage space is required for the base ID. However, it will affect the overall compression speed, so this step should only be used if the overall compression rate is more important than the query performance. In general, this step is best to improve compression if the distribution of the use of bases is far from uniform, since a large number of used bases will result in a short ID, while a small number of used bases will result in a long ID.
(3) Data compression
The time series data is cut and if the division into integer blocks cannot be performed, zero padding is used to form the last block. For each block, the base and offset are first identified, and then the ID of that particular base is found. If the base does not already exist, it is first added to the base dictionary and the next available integer is assigned. Once the ID is determined and the dictionary is updated, the next block will be processed until all the data is compressed.
According to the method and the device, data are compressed in a self-adaptive mode according to precision requirements, the lossy degree of lossy compression is controlled according to users and actual requirements, balance between the user requirements and database storage is achieved, the base and deviation of the stored data are utilized, the data are compressed to a large extent, the compression ratio is improved, the query efficiency of the database is kept, and in addition, similar Huffman coding ideas are used, the compression ratio is guaranteed, and the storage space is reduced.
The second embodiment is as follows: this embodiment is a further description of the first embodiment, and the difference between this embodiment and the first embodiment is that the specific step of the fifth step is:
step five, first: searching whether the same base exists in the base dictionary or not aiming at the obtained final base, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;
step two: traversing the time sequence data to be compressed aiming at the reserved base in the fifth step and obtaining the use times of the reserved base;
step five and step three: sequencing the reserved bases from small to large according to the using times, and then coding the bases sequenced from small to large according to the sequence from long ID coding to short ID coding;
step five four: updating the base dictionary by using the base coded in the step three and the corresponding ID;
step five: the ID and the offset are stored.
The third concrete implementation mode: this embodiment is a further description of the first embodiment, and the difference between this embodiment and the first embodiment is that the specific step of dividing the time series data to be compressed into different data blocks in the first step is as follows:
time sequence data to be compressed is divided into data blocks without repetition in an iterative mode, and the error is 0.1%.
The fourth concrete implementation mode: this embodiment is a further description of the first embodiment, and the difference between this embodiment and the first embodiment is that the specific step of the fifth step is:
and entropy coding is carried out on the final basis and the deviation, and the data after entropy coding is stored.
The fifth concrete implementation mode: the present embodiment is further described with respect to a fourth embodiment, and the difference between the present embodiment and the fourth embodiment is that the entropy encoding is: shannon coding, huffman coding or arithmetic coding.
The sixth specific implementation mode: the present embodiment is further described with respect to a fifth embodiment, and the difference between the present embodiment and the fifth embodiment is that the entropy encoding is huffman encoding.
The seventh embodiment: the embodiment discloses a time sequence database self-adaptive lossy compression system, which comprises: the device comprises a data acquisition module, a data cutting module, a data processing module and a storage module;
the data acquisition module is used for acquiring time sequence data to be compressed;
the data cutting module is used for cutting time sequence data to be compressed into different data blocks, and data in the data blocks are not repeated with data in other data blocks;
the data processing module is used for determining the basis and the deviation in the data block aiming at the segmented data block, and the specific steps are as follows:
taking a data point with the minimum correlation in each data block as deviation, taking the rest parts as bases, representing the corresponding data block by using the bases and the deviation, if the corresponding data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating storage spaces required by all the bases and the deviation;
making i = i +1, performing iteration based on the steps, stopping iteration if the current required storage space is larger than the last required storage space, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of data points with minimum correlation in each data block;
the storage module stores the determined basis and the deviation.
The specific implementation mode eight: the present embodiment is a further description of a seventh embodiment, and the difference between the present embodiment and the seventh embodiment is that the storage module specifically executes the following steps:
step 1: searching whether the same base exists in the base dictionary or not aiming at the obtained final base, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;
and 2, step: traversing the time sequence data to be compressed aiming at the base reserved in the step 1, and obtaining the use times of the reserved base;
and step 3: coding the short ID of the base with more use times, and coding the long ID of the base with less use times;
and 4, step 4: updating the base dictionary by using the base coded in the step 3 and the corresponding ID;
and 5: the ID and the offset are stored.
The specific implementation method nine: the embodiment discloses a computer-readable storage medium, in which a time-series database adaptive lossy compression program is stored, and when the time-series database adaptive lossy compression program is executed by a computer, the following steps are specifically executed:
step A: acquiring time sequence data to be compressed, and dividing the time sequence data to be compressed into different data blocks, wherein the data in the data blocks are not repeated with the data in other data blocks;
and B, step B: deleting data point positions in each data block except for the precision requirement according to the precision requirement;
step C: for the data blocks processed in the step B, taking one data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the corresponding data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting the repeated bases, and finally calculating all the bases and the storage space required by the deviation;
step D: c, enabling i = i +1, executing the step C to iterate, if the current required storage space is larger than the last required storage space, stopping iteration, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of data point locations with the minimum correlation in each data block;
step E: searching whether the same base exists in the base dictionary or not aiming at the obtained final base, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;
step F: traversing the time sequence data to be compressed aiming at the base reserved in the step E, and obtaining the use times of the reserved base;
step G: coding the short ID of the base with more use times, and coding the long ID of the base with less use times;
step H: updating the base dictionary by using the base coded in the step G and the corresponding ID;
step I: the ID and the offset are stored.
Example (b):
(1) The configuration process of the self-adaptive data lossy compression algorithm comprises the following steps:
the method comprises the following steps: the user proposes the data compression ratio in the expected reasonable range and the applied compressed data range according to the characteristics or the needs of the data of the user, such as: certain data for a certain period of time.
Step two: and performing deviation quantity reservation decision according to the parameters transmitted by the user, selecting a certain quantity of deviation and feeding back the deviation to the expected effect of the user.
(2) Lossy data compression algorithm application process
The method comprises the following steps: and dividing the data segment. The amount of data to be concatenated depends on how much correlation between adjacent samples, and the base of each sample uses fewer bits than if it were not concatenated, thereby reducing the overhead of representing the correlation. At the same time, connecting too many samples will reduce the probability of matching bits, since they need to be equal in more bits, and therefore it is likely that there will be fewer matches overall. Therefore, by using an iterative method, the data of the whole file is divided into data segments without repetition, and the uniqueness of the data segments is ensured (corresponding errors can exist).
Step two: the basis and deviation are determined. Applying a transformation technique to the data segment to obtain a base and a bias, and if the bases of the plurality of blocks are the same, deleting the duplicate bases and only preserving the bias. And analyzing the required deviation quantity according to the user requirement condition, and discarding and storing part of the deviation when the precision range allows.
Step three: and (6) encoding data. And traversing the base and finding the using number of the base by using a method similar to Huffman coding, and performing coding storage based on the number.
It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations that fall within the spirit and scope of the invention be limited only by the claims and the description.

Claims (7)

1. The self-adaptive lossy compression method of the time sequence database is characterized by comprising the following steps of:
the method comprises the following steps: acquiring time sequence data to be compressed, and dividing the time sequence data to be compressed into different data blocks, wherein the data in the data blocks are not repeated with the data in other data blocks;
step two: deleting data point positions in each data block except for the precision requirement according to the precision requirement;
step three: regarding the data blocks processed in the second step, taking the data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating all the bases and the storage space required by the deviation;
step four: enabling i = i +1, executing the third step to perform iteration, stopping iteration if the current required storage space is larger than the last required storage space, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of the data point positions with the minimum correlation in each data block;
step five: storing the final basis and the deviation;
the concrete steps of the fifth step are as follows:
step five, first: searching whether the same base exists in the base dictionary or not aiming at the obtained final base, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;
step five two: traversing the time sequence data to be compressed aiming at the reserved base in the fifth step and obtaining the use times of the reserved base;
step five and step three: sequencing the reserved bases from small to large according to the using times, and then coding the bases sequenced from small to large according to the sequence from long ID coding to short ID coding;
step five and four: updating the base dictionary by using the base coded in the step five and the corresponding ID;
step five: the ID and the offset are stored.
2. The adaptive lossy compression method for time-series databases of claim 1, wherein the step one of dividing the time-series data to be compressed into different data blocks comprises the following specific steps:
time sequence data to be compressed is divided into data blocks without repetition in an iterative mode, and the error is 0.1%.
3. The adaptive lossy compression method for time series databases according to claim 1, wherein the concrete steps of the fifth step are:
and entropy coding is carried out on the final basis and the deviation, and the data after entropy coding is stored.
4. The method of adaptive lossy compression for a time series database according to claim 3, wherein the entropy coding is: shannon coding, huffman coding or arithmetic coding.
5. The method of adaptive lossy compression for a time series database of claim 4, wherein the entropy coding is Huffman coding.
6. Time series database self-adaptation has a loss compression system, its characterized in that includes: the device comprises a data acquisition module, a data cutting module, a data processing module and a storage module;
the data acquisition module is used for acquiring time sequence data to be compressed;
the data cutting module is used for cutting time sequence data to be compressed into different data blocks, and data in the data blocks are not repeated with data in other data blocks;
the data processing module is used for determining the basis and the deviation in the data block aiming at the segmented data block, and the specific steps are as follows:
taking a data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the corresponding data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating storage spaces required by all the bases and the deviations;
enabling i = i +1, performing iteration based on the steps, stopping iteration if the current required storage space is larger than the last required storage space, and taking the base sum deviation obtained last time as a final base sum deviation, wherein i represents the number of data point positions with minimum correlation in each data block;
the storage module stores the determined base and the deviation;
the storage module specifically executes the following steps:
step 1: searching whether the same base exists in the base dictionary or not aiming at the obtained final base, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;
and 2, step: traversing the time sequence data to be compressed aiming at the reserved base in the step 1, and obtaining the use times of the reserved base;
and 3, step 3: carrying out short ID coding on the base with the large number of use times, and carrying out long ID coding on the base with the small number of use times;
and 4, step 4: updating the base dictionary by using the base coded in the step 3 and the corresponding ID;
and 5: the ID and the offset are stored.
7. A computer-readable storage medium, wherein the medium stores a time-series database adaptive lossy compression program, and when the time-series database adaptive lossy compression program is executed by a computer, the computer-readable storage medium specifically executes the following steps:
step A: acquiring time sequence data to be compressed, and dividing the time sequence data to be compressed into different data blocks, wherein the data in the data blocks are not repeated with the data in other data blocks;
and B, step B: deleting data point positions in each data block except for the precision requirement according to the precision requirement;
and C: for the data blocks processed in the step B, taking the data point with the minimum correlation in each data block as a deviation, taking the rest parts as bases, representing the corresponding data blocks by using the bases and the deviations, if the data blocks contain the same bases, sharing one base by the corresponding data blocks, deleting repeated bases, and finally calculating all the bases and the storage space required by the deviation;
step D: c, enabling i = i +1, executing the step C to iterate, if the current required storage space is larger than the last required storage space, stopping iteration, and taking the last obtained base sum deviation as a final base sum deviation, wherein i represents the number of data point locations with the minimum correlation in each data block;
step E: searching whether the same base exists in the base dictionary or not aiming at the obtained final base, if the same base exists in the base dictionary, recording the ID of the base in the base dictionary, and if the same base does not exist in the base dictionary, reserving the base;
step F: traversing the time sequence data to be compressed aiming at the base reserved in the step E, and obtaining the use times of the reserved base;
step G: carrying out short ID coding on the base with the large number of use times, and carrying out long ID coding on the base with the small number of use times;
step H: updating the base dictionary by using the base coded in the step G and the corresponding ID;
step I: the ID and the offset are stored.
CN202210318623.8A 2022-03-29 2022-03-29 Time sequence database self-adaptive lossy compression method, system and medium Active CN114665884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210318623.8A CN114665884B (en) 2022-03-29 2022-03-29 Time sequence database self-adaptive lossy compression method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210318623.8A CN114665884B (en) 2022-03-29 2022-03-29 Time sequence database self-adaptive lossy compression method, system and medium

Publications (2)

Publication Number Publication Date
CN114665884A CN114665884A (en) 2022-06-24
CN114665884B true CN114665884B (en) 2022-11-25

Family

ID=82034096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210318623.8A Active CN114665884B (en) 2022-03-29 2022-03-29 Time sequence database self-adaptive lossy compression method, system and medium

Country Status (1)

Country Link
CN (1) CN114665884B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817831B (en) * 2022-06-30 2022-09-23 四川公路工程咨询监理有限公司 Computing auxiliary method for building engineering economy
CN115269526B (en) * 2022-09-19 2023-03-24 誉隆半导体设备(江苏)有限公司 Method and system for processing semiconductor production data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109687875A (en) * 2018-11-20 2019-04-26 成都四方伟业软件股份有限公司 A kind of time series data processing method
CN109716658A (en) * 2016-12-15 2019-05-03 华为技术有限公司 A kind of data de-duplication method and system based on similitude
CN114218287A (en) * 2021-12-30 2022-03-22 北京诺司时空科技有限公司 Query time prediction method for time sequence database

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109716658A (en) * 2016-12-15 2019-05-03 华为技术有限公司 A kind of data de-duplication method and system based on similitude
CN109687875A (en) * 2018-11-20 2019-04-26 成都四方伟业软件股份有限公司 A kind of time series data processing method
CN114218287A (en) * 2021-12-30 2022-03-22 北京诺司时空科技有限公司 Query time prediction method for time sequence database

Also Published As

Publication number Publication date
CN114665884A (en) 2022-06-24

Similar Documents

Publication Publication Date Title
CN114665884B (en) Time sequence database self-adaptive lossy compression method, system and medium
US10824596B2 (en) Adaptive dictionary compression/decompression for column-store databases
US8838551B2 (en) Multi-level database compression
CN112953550B (en) Data compression method, electronic device and storage medium
US6831575B2 (en) Word aligned bitmap compression method, data structure, and apparatus
CN109325032B (en) Index data storage and retrieval method, device and storage medium
US20190258619A1 (en) Data compression method, data compression device, computer program, and database system
US9236881B2 (en) Compression of bitmaps and values
Sirén Burrows-Wheeler transform for terabases
CN104834539A (en) Data increment updating method
Cherniavsky et al. Grammar-based compression of DNA sequences
CN110263917B (en) Neural network compression method and device
KR20030071327A (en) Improved huffman decoding method and apparatus thereof
CN116915259A (en) Bin allocation data optimized storage method and system based on internet of things
Beal et al. Compressed parameterized pattern matching
Andrzejewski et al. GPU-PLWAH: GPU-based implementation of the PLWAH algorithm for compressing bitmaps
CN111414445A (en) Address inverse analysis method applying geographic information
US20230053844A1 (en) Improved Quality Value Compression Framework in Aligned Sequencing Data Based on Novel Contexts
CN115882867A (en) Data compression storage method based on big data
WO2009001174A1 (en) System and method for data compression and storage allowing fast retrieval
Reyes Zambrano GPS trajectory compression algorithm
CN114679184B (en) Data compression method and system for time sequence database
CN116405037B (en) Astronomical star table-oriented compression preprocessing encoder and application
CN111384962A (en) Data compression/decompression device and data compression method
US11119702B1 (en) Apparatus for processing received data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant