CN113258934A

CN113258934A - Data compression method, system and equipment

Info

Publication number: CN113258934A
Application number: CN202110703279.XA
Authority: CN
Inventors: 邹婷; 王楠; 段泽
Original assignee: Beijing Highlandr Digital Technology Co ltd
Current assignee: Beijing Highlandr Digital Technology Co ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-08-13

Abstract

The embodiment of the invention discloses a data compression method, which comprises the following steps: calculating a polynomial fitting function of a data set of a data file to be compressed; calculating a fitting function calculation value corresponding to each numerical value in the data set according to the polynomial fitting function; calculating the difference value between each numerical value and the corresponding fitting function calculation value to obtain a difference value set; and storing the data head of the compressed data file, the difference set and the polynomial fitting function to obtain the compressed data file. The invention also discloses a data compression system and equipment. The invention has the beneficial effects that: by adopting the data compression method and the data compression system, the data precision can not be reduced, the compression ratio and the compression efficiency of the data can be improved, the data storage expense can be reduced, the bandwidth requirement during network transmission can be reduced, and the data transmission under the condition of poor network conditions can be facilitated.

Description

Data compression method, system and equipment

Technical Field

The present invention relates to the field of data compression technologies, and in particular, to a data compression method, system, and device.

Background

For data with large data volume, the existing data compression method has the disadvantages of insufficient compression ratio, low compression efficiency, too complex compression method and huge data model, which is not beneficial to network transmission. For example, meteorological data has a large number of elements, a large time span, and a large overall data amount, but the conventional data compression method cannot satisfy requirements for compression ratio, compression efficiency, and accuracy at the same time.

Disclosure of Invention

In order to solve the above problems, the present invention provides a data compression method, system and device with high data precision, high compression ratio and high compression efficiency.

The invention provides a data compression method, which comprises the following steps:

calculating a polynomial fitting function of a data set of a data file to be compressed;

calculating a fitting function calculation value corresponding to each numerical value in the data set according to the polynomial fitting function;

calculating the difference value between each numerical value and the corresponding fitting function calculation value to obtain a difference value set;

and storing the data head of the compressed data file, the difference set and the polynomial fitting function to obtain the compressed data file.

As a further improvement of the invention, the data file to be compressed comprises a plurality of data sets, and a polynomial fitting function and a difference value set of each data set are respectively calculated.

As a further improvement of the invention, a polynomial fitting function of the data set of the data file to be compressed is calculated, and the polynomial fitting function of the data set is calculated by adopting a least square method.

As a further improvement of the invention, the data set of the data file to be compressed is a NetCDF data set, the NetCDF data set comprises a plurality of variables, the variables are N-dimensional arrays with time as an independent variable, and N is a positive integer.

As a further improvement of the present invention, the NetCDF data set is divided into a plurality of data subsets according to the difference of variables, each variable corresponds to one data subset, a polynomial fitting function and a difference set of each data subset are sequentially calculated, and the data headers, the polynomial fitting function and the difference set of the data subsets are sequentially stored to obtain a compressed data file.

As a further improvement of the invention, the obtained compressed data file is subjected to secondary compression, and the secondary compression adopts a zstd compression algorithm.

The present invention also provides a data compression system, the system comprising:

the data set acquisition module is used for reading the data files to be compressed to obtain M data sets of the data files to be compressed, wherein M is a positive integer;

a polynomial fitting module for calculating a polynomial fitting function of the M data sets, respectively, to obtain M polynomial fitting functions;

the difference value calculation module is used for calculating a fitting function calculation value corresponding to each numerical value in the data set according to the polynomial fitting function aiming at each data set, and calculating the difference value between each numerical value and the corresponding fitting function calculation value to obtain a difference value set of M data sets;

and the data compression module is used for storing the data heads of the M data sets of the compressed data file, the difference set and the polynomial fitting function to obtain the compressed data file.

As a further improvement of the invention, the polynomial fitting module respectively calculates the polynomial fitting functions of the M data sets by adopting a least square method.

As a further improvement of the present invention, the data set acquisition module divides the NetCDF data set into a plurality of data subsets according to the difference of variables, and each variable corresponds to one data subset; the polynomial fitting module calculates a polynomial fitting function of each data subset in sequence; the difference value calculation module calculates a fitting function calculation value corresponding to each numerical value in each data subset in sequence, and calculates the difference value between each numerical value and the corresponding fitting function calculation value to obtain a difference value set; and the data compression module sequentially stores the data head, the difference set and the polynomial fitting function of each data subset in a storage manner to obtain a compressed data file.

As a further improvement of the invention, the system also comprises a secondary compression module, wherein the secondary compression module carries out secondary compression on the obtained compressed data file, and the secondary compression adopts a zstd compression algorithm.

The invention provides an electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer instructions, and wherein the one or more computer instructions are executed by the processor to implement the data compression method.

The present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to implement the above-mentioned data compression method.

The invention has the beneficial effects that: by calculating a polynomial fitting function of a data set of a data file to be compressed and calculating a difference value between a fitting function calculation value corresponding to each numerical value in the data set and an original numerical value, the data length of the difference value is smaller than that of the original numerical value, and the required storage space is smaller, so that the purpose of data compression is achieved; by adopting the data compression method and the data compression system, the data precision can not be reduced, the compression ratio and the compression efficiency of the data can be improved, the data storage expense can be reduced, the bandwidth requirement during network transmission can be reduced, and the data transmission under the condition of poor network conditions can be facilitated.

Drawings

Fig. 1 is a schematic flow chart of a data compression method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a compression process of a data compression method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a decompression process of a data compression method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a polynomial fit curve of a data compression method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a data compression system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a data compression method according to an embodiment of the present invention includes: calculating a polynomial fitting function of a data set of a data file to be compressed; calculating a fitting function calculation value corresponding to each numerical value in the data set according to a polynomial fitting function; then calculating the difference value between each numerical value and the corresponding fitting function calculation value to obtain a difference value set; and storing the data head, the difference set and the polynomial fitting function of the compressed data file to obtain the compressed data file.

The data file to be compressed can have one or more data sets, and the same type of data can be classified into one data set, and each type of data has respective data attributes. In this embodiment, one process for implementing the method is as follows: after the data file to be compressed is obtained, a data header and data attributes of the data file to be compressed can be read into a memory, a polynomial fitting function and a difference set are sequentially solved for each data set, the polynomial fitting function and the difference set are stored in the memory, finally, the data header stored in the memory is simplified and stored in the data file, the obtained polynomial fitting function and the obtained difference set are sequentially stored in the data file, and the compressed data file is obtained.

By calculating a polynomial fitting function, calculating the difference value between the fitting function calculation value corresponding to each numerical value in the data set and the original numerical value, and enabling the difference value to occupy fewer bytes as far as possible, only the difference value set and the polynomial fitting function are finally stored, and therefore the purpose of reducing the storage space is achieved. By the method, the precision of the original data (the data in the data file to be compressed) is not influenced.

According to an optional implementation mode, the data file to be compressed comprises a plurality of data sets, each data set is the same type of data with relevance between the data, the types of the data can be distinguished according to data attributes, after the data sets are obtained, a polynomial fitting function and a difference value set of each data set are respectively calculated, and finally, a data header, the polynomial fitting function and the difference value set of each data set are sequentially stored to obtain the compressed data file. By utilizing the relevance among the same type of data, the polynomial fitting function and the difference value set of each data set are obtained through classification calculation, and because the difference value is smaller than the original value, only the difference value set and the polynomial fitting function are finally stored, so that the aim of reducing the storage space is fulfilled. Particularly for data with precision requirements, for example, on the premise of ensuring 2-bit decimal precision, the method provided by the embodiment of the invention can improve the data compression ratio by 15 to 20 times, and compared with the existing compression method (the compression ratio is about 5 times), the compression ratio is higher, and meanwhile, the data precision requirements can be met.

In an alternative embodiment, the polynomial fitting function of the data set of the data file to be compressed is calculated by using a least square method. Least squares (also known as the least squares method) is a mathematical optimization technique that finds the best match function for data by minimizing the sum of squares of the errors. The unknown data can be easily determined by the least square method, and the sum of the squares of the errors between these determined data and the actual data is minimized, i.e., the difference value finally stored is smaller. One implementation process of calculating the polynomial fitting function by the least square method in this embodiment is as follows:

setting a fitting polynomial:

assuming coefficients of the optimal function

j (j = 1, 2, 3.. times.n) minimizes the sum of squared errors S, so that for an optimal function, its sum of squared errors S is applied to the polynomial coefficients

The partial derivative of j (j = 1, 2, 3.. times, n) should satisfy:

j is 0, 1, 2, when n is taken, the following are:

and (5) decomposing the error square sum S into a matrix form. Order:

the sum of squared errors S can be written as:

is a Vandermonde Matrix (Vandermonde Matrix),

still a coefficient vector of polynomial coefficients,

is the output vector of the sample data set. For the optimal function, it should satisfy:

polynomial coefficient vector for obtaining optimal function

Comprises the following steps:

obtaining the coefficient matrix

]Meanwhile, a polynomial fitting function is obtained.

The essence of the matrix method in the embodiment of the invention is that a Van der Monde matrix is constructed through a sample set, and a univariate N-degree polynomial nonlinear regression problem is converted into an N-degree linear regression problem (namely, multiple linear regression).

For the solution of the linear regression problem, we use here the QR decomposition based on the Householder transform. The specific derivation process is as follows:

the least squares is generally of the form:

wherein

Is a residual function, representing the difference between the predicted value and the measured value,

as a function of loss

1) When in use

In the case of a linear equation, the linear least squares problem is:

the expansion is as follows:

the derivation is:

when the derivative is 0, the value of the loss function is found to be the minimum, so:

the above description yields a linear least squares problem

Is solved as

。

Because the matrix reciprocal is required, in order to reduce the calculation difficulty, QR decomposition can be adopted for solving.

First, a is QR decomposed, i.e. a = QR where

For the upper triangular matrix:

wherein RR is an upper triangular matrix, inversion is relatively easy, and direct pairing is avoided

The inversion complexity is high.

2) In the case of a non-linear equation, the least squares problem is

，

Let the state vector x = (x1, x 2., xm),

first order Tayor expansion:

where is the Jacobian matrix, expressed as:

iteration

Until convergence, the optimal solution x is obtained

QR decomposition may also be employed here to solve for Δ x

And setting QR decomposition to obtain:

，

an analog linear least squares method is used,

，

the nonlinear least squares problem is therefore solved iteratively as:

。

preferably, the polynomial fitting function selects the over-fitting curve function, so that the calculated difference is smaller, and the storage space occupied by the finally obtained compressed data file is smaller.

In an optional embodiment, the data set of the data file to be compressed is a NetCDF data set, where the NetCDF data set includes a plurality of variables, the variables are N-dimensional arrays using time as an argument, and N is a positive integer. Dividing the NetCDF data set into a plurality of data subsets according to different variables, wherein each variable corresponds to one data subset, sequentially calculating a polynomial fitting function and a difference set of each data subset, and sequentially storing a data head, the polynomial fitting function and the difference set of each data subset to obtain a compressed data file.

One implementation of an embodiment of the invention, for example, compresses meteorological data. The NetCDF format is the most common storage format for meteorological data, and because there are many related elements in meteorological data and the time span is large, the data size is generally large, and the large number causes problems in data storage and network transmission. The existing compression technology of the NetCDF meteorological data mainly comprises the following steps:

1. rooka and the like establish a two-dimensional linear prediction statistical model of meteorological grid point data by analyzing the correlation between adjacent grid points of common meteorological elements and calculating the symbol entropy and the information redundancy of an element field, eliminate redundant information and provide a new method for lossless compression of data by combining Huffman coding.

2. The BP neural network is introduced on the basis of two-dimensional linear prediction, a secondary prediction model based on the neural network is established, redundant information of meteorological grid point data is effectively eliminated, and a novel efficient lossless compression scheme is provided by combining entropy coding.

3. In consideration of flood and the like, a new method is provided for changing the storage sequence of meteorological data through subtraction operation of adjacent time data, enabling the high 8 bits of the stored data to have 00 values or FF values as much as possible, and then compressing through a Win-RAR tool.

4. The method reduces redundant reading among meteorological lattice point data by a quadratic linear prediction method, solves the problem of large memory space occupied by the meteorological lattice point data during storage, is obtained by combining 500hPa height field analysis, and has the advantage that the variance of a prediction error sequence is one order of magnitude smaller than that of an original sequence, thereby showing that the correlation of the error sequence is greatly reduced, and proving that the lattice point data prediction compression can be completely realized.

Although the prior art can compress meteorological data on the premise of ensuring certain data precision, the problems of low compression ratio, complex compression method, huge data model and the like are solved, and the network transmission is not facilitated. For example, in the prior art, when the precision of 1 digit decimal is guaranteed, the highest compression ratio is about 10 times, and when the precision of 2 digit decimal is guaranteed, the compression ratio is about 5 times. In addition, the compression ratio is uncertain due to terrain changes only aiming at data without special ocean lattice point elements. Therefore, the prior art cannot be applied to common grid point meteorological data, the application range is not general enough, and the limitation is large.

The method compresses the NetCDF meteorological data, calculates to obtain a polynomial fitting function by utilizing the relevance between adjacent time grid point data, calculates a fitting function calculation value corresponding to each numerical value according to the polynomial fitting function, then calculates the difference value between the original storage numerical value and the fitting function calculation value thereof to obtain a difference value set, ensures that the difference value is not more than 1 byte as far as possible, and finally only stores the difference value set and the polynomial fitting function, thereby achieving the purpose of reducing the meteorological data storage space. One implementation process for compressing the NetCDF meteorological data implemented by the invention is as follows:

as shown in fig. 2, the NetCDF meteorological data is read into the memory, and then the different meteorological element data (such as air temperature, air pressure, wind, humidity, cloud, precipitation and various weather phenomena) are classified into a plurality of data sets according to the time-varying sequence through the grid point data, the data sets of different elements are stored in the memory, and the file header and the element attributes are also read into the memory. Assuming that the meteorological data is a 4-dimensional N element, time, longitude, and latitude (variables) are respectively expressed as: t, I, J, each data is stored in 8 bytes, the total data size (pure data size) is: t I J N8. Since the data set is changed according to the laws of time, longitude and latitude, redundant storage of the three variables can be removed from each element, only three values T, I, J need to be stored, and therefore the total data volume can be directly reduced by T I J3 bytes.

After the processing, I X J (N-3) data sets are obtained, then optimal polynomial fitting is carried out on each data set, and a polynomial fitting function is calculated. In this embodiment, a least square method is used to perform polynomial fitting, for example, a data set (original data set) obtained by intercepting the change of the atmospheric pressure element of one grid point in a certain NetCDF meteorological data file over time is as follows: "103678.999 '," 103673.769', "103677.423 '," 103723.144', "103831.389 '," 103993.258', "104192.559 '," 104393.131', "104571.853 '," 104726.615', "104876.057 '," 105039.165', "105228.224 '," 105423.211', "105550.947 '," 105555.651', "105517.754 '," 105514.537', "a multi-order fitting by a multi-order fitting method by a multi-order method by a method of a multi-order method:

，

the resulting fitted curve is shown in fig. 4.

The set of differences between the calculated raw data and the calculated values by the polynomial fitting function is:

[80.0, 36.0, -32.0, -84.0, -97.0, -72.0, -21.0, 21.0, 39.0, 33.0, 24.0, 35.0, 83.0, 148.0, 160.0, 66.0, -52.0, -116.0, -122.0, -97.0, -106.0, -130.0, -132.0, -71.0, 23.0, 49.0, 70.0, 93.0, 163.0, 223.0, 151.0, 40.0, -73.0, -149.0, -159.0, -120.0, -57.0, 4.0, 80.0, 68.0, 0.0]，

comparing the original data set with the difference set, it can be seen that the value of the original data set is larger, the difference set is reduced by a very large amount relative to the original data set, and compared with the pure data size T × I × J × 8 × N of the original data set, the data size of the difference set is reduced to T × I × J1 (N-3). On the premise of ensuring the precision of 2-bit decimal, the data compression ratio is improved to 15 to 20 times, and compared with other modes, the method obviously improves the data compression ratio and compression efficiency and meets the high-precision requirement of data.

And solving an optimal fitting polynomial, namely a polynomial fitting function, from the I, J, N-3 data sets in sequence, storing a difference set between a calculated value of the fitting function and an original value, and storing the polynomial fitting function and the difference set to a memory. And finally, performing data coding storage on the processed data, wherein the data coding storage comprises the following steps: first, simplifying the head data of a data file to be compressed and storing the head data into the file, and then sequentially storing a polynomial fitting function and a difference value set into the file to obtain the compressed data file.

In an optional implementation manner, after the compressed data file is obtained, the compressed data file is subjected to secondary compression, and the secondary compression adopts a zstd compression algorithm, so that the data compression rate can be further improved by about 15%. The compressed data file is more beneficial to network transmission, and particularly under the condition of poor network conditions (such as inland river and ship shore communication), the data transmission pressure is lower.

For the data compression process of this embodiment, the corresponding data decompression process is: as shown in fig. 3, firstly, the zstd compression algorithm is called to decompress the data file; then, reading the simplified head data, the polynomial fitting function and the difference value set stored in sequence, and temporarily storing the difference value set in an internal memory; then finding out a corresponding polynomial fitting function and a corresponding difference set, and restoring an original data value by calculating the sum of a calculated value and a difference value of the polynomial fitting function; and finally, saving the restored data value to a file according to the format.

As shown in fig. 5, a data compression system according to an embodiment of the present invention includes: the device comprises a data set acquisition module, a polynomial fitting module, a difference value calculation module and a data compression module; the data set acquisition module is used for reading a data file to be compressed to obtain M data sets of the data file to be compressed, wherein M is a positive integer; the polynomial fitting module is used for respectively calculating polynomial fitting functions of the M data sets to obtain M polynomial fitting functions; the difference value calculation module calculates a fitting function calculation value corresponding to each numerical value in the data set according to a polynomial fitting function aiming at each data set, and calculates the difference value between each numerical value and the corresponding fitting function calculation value to obtain difference value sets of M data sets; the data compression module is used for storing data heads, polynomial fitting functions and difference value sets of M data sets of compressed data files to obtain the compressed data files.

The data compression system of the present embodiment can also compress meteorological data in NetCDF format. The method comprises the steps of calculating to obtain a polynomial fitting function by utilizing the relevance between grid point data of adjacent time of meteorological data, calculating a fitting function calculation value corresponding to each numerical value according to the polynomial fitting function, calculating the difference value between an original storage numerical value and the fitting function calculation value to obtain a difference value set, ensuring that the difference value is not more than 1 byte as far as possible, and finally only storing the difference value set and the polynomial fitting function, thereby achieving the purpose of reducing the meteorological data storage space. Wherein, the least square algorithm can be adopted for calculating the polynomial fitting function. Preferably, the polynomial fitting function selects the over-fitting curve function, so that the calculated difference is smaller, and the storage space occupied by the finally obtained compressed data file is smaller.

In an optional implementation manner, the system implemented by the present invention further includes a secondary compression module, where the secondary compression module performs secondary compression on the obtained compressed data file, and the secondary compression employs a zstd compression algorithm. Through secondary compression, the compression rate of the data file can be improved by about 15 percent.

The invention also relates to an electronic device comprising the server, the terminal and the like. The electronic device includes: at least one processor; a memory communicatively coupled to the at least one processor; and a communication component communicatively coupled to the storage medium, the communication component receiving and transmitting data under control of the processor; wherein the memory stores instructions executable by the at least one processor to implement the method of the above embodiments.

In an alternative embodiment, the memory is used as a non-volatile computer-readable storage medium for storing non-volatile software programs, non-volatile computer-executable programs, and modules. The processor executes various functional applications of the device and data processing, i.e., implements the method, by executing nonvolatile software programs, instructions, and modules stored in the memory.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store a list of options, etc. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be connected to the external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory and, when executed by the one or more processors, perform the methods of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, has corresponding functional modules and beneficial effects of the execution method, and can refer to the method provided by the embodiment of the application without detailed technical details in the embodiment.

The present invention also relates to a computer-readable storage medium for storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Furthermore, those of ordinary skill in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It will be understood by those skilled in the art that while the present invention has been described with reference to exemplary embodiments, various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A method of data compression, the method comprising:

2. The method of claim 1, wherein the data file to be compressed comprises a plurality of data sets, and the polynomial fitting function and the difference set are calculated separately for each data set.

3. The method of claim 1, wherein the polynomial fit function of the data set is calculated using a least squares method.

4. The method according to claim 1, wherein the dataset of the data file to be compressed is a NetCDF dataset, the NetCDF dataset comprises a plurality of variables, the variables are N-dimensional arrays with time as an argument, and N is a positive integer.

5. The method according to claim 4, wherein the NetCDF data set is divided into a plurality of data subsets according to different variables, each variable corresponds to one data subset, a polynomial fitting function and a difference set of each data subset are sequentially calculated, and data headers, the polynomial fitting function and the difference set of the data subsets are sequentially stored to obtain a compressed data file.

6. The method according to any one of claims 1 to 5, wherein the resulting compressed data file is subjected to a secondary compression using a zstd compression algorithm.

7. A data compression system, the system comprising:

8. The system of claim 7, further comprising a secondary compression module that secondarily compresses the resulting compressed data file, wherein the secondary compression employs a zstd compression algorithm.

9. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of any of claims 1-6.

10. A computer-readable storage medium, on which a computer program is stored, the computer program being executable by a processor for implementing the method according to any one of claims 1-6.