WO2019128409A1

WO2019128409A1 - Data compression and storage method and data compression and storage device

Info

Publication number: WO2019128409A1
Application number: PCT/CN2018/111180
Authority: WO
Inventors: 何东杰
Original assignee: 中国银联股份有限公司
Priority date: 2017-12-28
Filing date: 2018-10-22
Publication date: 2019-07-04
Also published as: TWI683548B; TW201931780A; CN108304472A

Abstract

The present invention relates to a data compression and storage method and a data compression and storage device. The data compression method comprises the following steps: a segmentation step: segmenting original data into a plurality of fields; and a compression step: on the basis of different data contents, compressing different fields by using different compression policies, and storing the compressed data. According to the data compression and storage method and the data compression and storage device of the present invention, different compression methods can be used in consideration of different data contents, data compression efficiency can be effectively improved, and compared with common data compression tools such as GZIP and SNAPPY, data compression rate is significantly improved.

Description

Data compression storage method and data compression storage device

Technical field

The present invention relates to data processing technologies, and in particular, to a data compression storage method and a data compression storage device.

Background technique

When enterprises are storing data, they generally compress and store data in terms of saving storage space and improving read efficiency. However, the generic compression tool is for all data.

Furthermore, existing common data compression tools, including GZIP, SNAPPY, etc., are compressed for general data.

However, as mentioned above, the compression tools used by enterprises to store data are common tools for all data. For enterprises, the characteristics of enterprise data are not fully considered. Therefore, the compression efficiency of the data is not very high.

Summary of the invention

In view of the problems, the present invention is directed to a further data compression storage method and data compression storage device.

The data compression storage method of the present invention is characterized in that it comprises the following steps:

Splitting the steps to split the raw data into multiple fields;

The compression step compresses and compresses the compressed data for different fields based on different data contents.

Preferably, in the compressing step, the relationship between the determination fields is determined to be strong, and the compression policy is set according to the strength of the association relationship.

Preferably, said compressing step comprises the following substeps:

Perform content analysis on data cut into multiple fields to establish associations between fields;

For a single field, different compression policies are used for different data content for compressed storage.

Preferably, said compressing step comprises the following substeps:

Perform content analysis on data cut into multiple fields, establish a data distribution map and an association relationship diagram between the fields, and identify correlations between the data fields based on the data distribution graph and the association graph;

The multiple fields in which the correlation relationship exists are combined, and for the combined fields, different compression policies are used for different data contents for compression storage.

Preferably, said compressing step comprises the following substeps:

For a single field, different compression schemes are used for different data content for compression storage, and on the other hand, multiple fields of correlation are also combined, and for the combined fields, different data content is used for different combinations. The compression strategy is compressed for storage.

Preferably, as a compression strategy, binary storage is performed for the enumerated character string field, and the string value is converted to an integer or floating point for compression storage.

Preferably, as a compression strategy, the short fields are combined and then compressed, the approximate fields are compressed and the compressed information is compressed, and only one of the fields is reversely stored, and the inclusion relationship between the fields is included. Store redundant information after compressing it.

Preferably, as a compression strategy, binary storage is performed for the enumerated character string field, the string value is converted into an integer or floating point for compression storage, the short field is combined, then compressed, and the approximate field is compressed. The remaining information is compressed and stored, and only one of the fields is reversely stored for the information in the reverse order, and the compressed redundant information exists for the inclusion relationship between the fields.

Preferably, further comprising:

The mapping relationship storage step establishes a mapping relationship between the original data and the compressed data and stores the relationship.

A data compression storage device according to the present invention is characterized by comprising:

a segmentation module for splitting raw data into multiple fields;

The compression module is configured to compress and store the compressed data according to different data content by using different compression policies for different fields.

Preferably, the compression module determines that the association relationship between the fields is strong, and sets a compression policy according to the strength of the association relationship.

Preferably, further comprising:

The mapping relationship storage module is configured to establish a mapping relationship between the original data and the compressed data and store the relationship.

Preferably, the compressing step has the following sub-modules:

The content analysis sub-module performs content analysis on data cut into multiple fields, and establishes an association relationship between the fields;

Compression sub-module, for a single field, uses different compression strategies for different data content for compressed storage.

Preferably, the compression module is provided with:

The content analysis sub-module performs content analysis on data cut into multiple fields, establishes a data distribution map and an association relationship diagram between the fields, and identifies a correlation relationship between the data fields based on the data distribution graph and the association relationship graph;

The compression sub-module combines multiple fields with related relationships, and uses different compression policies for different data contents for compressed storage for the combined fields.

Preferably, the compression module is provided with:

The compression sub-module, for a single field, uses different compression policies for different data contents to be compressed and stored, and also combines multiple fields of related relationships, and uses different compression strategies for different data contents for the combined fields. Perform compressed storage.

The computer readable medium of the present invention has stored thereon a computer program, characterized in that the computer program is executed by a processor to implement the data compression storage method described above.

The computer device of the present invention comprises a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor executes the computer program to implement the steps of the data compression storage method described above .

As described above, according to the data compression storage method and the data compression storage device of the present invention, an efficient data compression scheme for enterprise data characteristics is proposed. By analyzing the characteristics of enterprise data, a corresponding efficient compression algorithm is used for different data fields. , thereby improving the efficiency of data compression compression. Compared with the general data compression tools such as GZIP and SNAPPY, there is a significant improvement in data compression rate.

DRAWINGS

1 is a flow chart showing a data compression storage method of the present invention.

Fig. 2 is a block diagram showing the structure of a data compression storage device of the present invention.

Detailed ways

The following are some of the various embodiments of the invention, which are intended to provide a basic understanding of the invention. It is not intended to identify key or critical elements of the invention or the scope of the invention.

The main idea of the data compression storage method and the data compression storage device of the present invention is to analyze the enterprise data content, establish a data distribution map, combine the data content and the data distribution, and adopt a corresponding optimized compression algorithm for each information. For example, the enumeration type string field is compressed by binary code, and the string storing the number is converted into an integer or a floating point type. Furthermore, the associated fields in the data content can be combined and then compressed. Furthermore, some fields are from a combination of other fields, and only one of the data can be stored, and some of the fields are in the reverse order of other fields, and only one of the data is stored.

Next, a data compression storage method of the present invention will be described.

As shown in FIG. 1, the data compression storage method of the present invention includes the following steps:

Segmenting step S100: cutting the original data into a plurality of fields;

The compressing step S200: compressing and storing the compressed compressed data by using different optimized compression strategies for different fields, wherein the relationship between the determination fields is strong and weak, according to the strength of the association relationship, according to different data contents. Different compression strategies.

According to the segmentation step S100 and the compression step S200, the relationship between the data table fields and the fields is established through the analysis of the data content, and then the corresponding optimized compression algorithm is used for compression, so that the effect of increasing the data compression rate can be achieved.

In a preferred manner, a mapping relationship storing step S300 may be further set, in which the mapping relationship between the original data and the compressed data is established in the metadata and stored, so that when the data is accessed from the outside, The mapping relationship smoothly resolves the original data.

Next, a brief description will be given of the data compression storage device of the present invention. Fig. 2 is a block diagram showing the structure of a data compression storage device of the present invention.

As shown in FIG. 2, the data compression storage device of the present invention comprises:

a segmentation module 100 for dividing the original data into a plurality of fields;

The compression module 200 is configured to compress and store the compressed compressed data by using different optimized compression policies for different fields, wherein the compression module 200 determines that the relationship between the fields is strong or not, according to the association. Strong relationship, set compression strategy.

In a preferred manner, the mapping relationship storage module 300 may be further configured to establish a mapping relationship between the original data and the compressed data in the mapping relationship storage module 300, so that the data can be accessed according to the mapping relationship when accessing data from the outside. Parse out the raw data. Further, the mapping relationship storage module 300 is not a structural unit necessary for the data compression storage device of the present invention, but is preferably one module.

Next, a specific embodiment of the data compression storage method and the data compression storage device of the present invention will be described.

First embodiment

The first embodiment relates to an embodiment in which compressed storage is performed for each field using an optimized compression strategy.

First, the data compression storage method of the first embodiment will be described.

The data compression storage method of the first embodiment includes a segmentation step and a compression step, wherein the segmentation step is the same as the segmentation step S100 described above, and the compression step specifically includes the following substeps:

For example, as a compression strategy, different optimized data compression storage methods are used for different data contents; for example, enumeration type uses binary storage, string value conversion to integer or floating point storage, and the like.

In a preferred manner, a mapping relationship storing step may be further set, in which a mapping relationship between the original data and the compressed data is established and stored, so that when the data is accessed from the outside, the mapping relationship can be The original data is successfully parsed.

Furthermore, the data compression storage device of the first embodiment will be briefly described.

The data compression storage device of the first embodiment includes a segmentation module and a compression module. The function of the segmentation module in the first embodiment is the same as that of the segmentation module 100 described above. The compression module in the first embodiment specifically includes the following sub-modules:

In addition, the mapping relationship storage module may be selectively set, and the mapping relationship between the original data and the compressed data is established and stored in the mapping relationship storage module.

Second embodiment

The second embodiment relates to an implementation that uses an optimized compression strategy for multiple fields.

First, the data compression storage method of the second embodiment will be described.

The data compression storage method of the second embodiment includes a segmentation step and a compression step, wherein the segmentation step is the same as the segmentation step S100 described above, and the compression step specifically includes the following substeps:

For example, as a compression strategy, multiple fields are combined, and a related optimized data compression storage method is adopted; for example, a short field is combined and then compressed, and an approximate field is compressed and compressed, and the information is reversed between fields. Store one of them, compressive redundant information with an inclusion relationship between fields, store it, and so on.

Furthermore, the data compression storage device of the second embodiment will be briefly described.

The data compression storage device of the second embodiment includes a segmentation module and a compression module. The function of the segmentation module in the first embodiment is the same as that of the segmentation module 100 described above. The compression module in the first embodiment specifically includes the following sub-modules:

Third embodiment

The third embodiment relates to an implementation that uses an optimized compression strategy for both a single field and multiple field combinations.

First, a data compression storage method of the third embodiment will be described.

The data compression storage method of the third embodiment includes a segmentation step and a compression step, wherein the segmentation step is the same as the segmentation step S100 described above, and the compression step specifically includes the following substeps:

For a single field, different compression policies are used for different data content for compression storage, and multiple fields of related relationships are also combined. For the combined fields, different compression policies are used for different data contents for compression storage.

As a compression strategy, multiple fields are combined, and the optimized data compression storage method is adopted for single field and multi-field; for example, the enumeration type uses binary storage, the string value is converted into integer or floating point storage, and the short field is combined. After compressing the storage, approximating the field to compress the redundant information, compressing the storage, storing the information in the reverse order of the fields, storing only one of them, storing the redundant information with the inclusion relationship between the fields, and so on.

Furthermore, the data compression storage device of the third embodiment will be briefly described.

A data compression storage device according to a third embodiment includes a segmentation module and a compression module. The function of the segmentation module in the first embodiment is the same as that of the segmentation module 100 described above. The compression module in the first embodiment specifically includes the following sub-modules:

The content analysis sub-module performs content analysis on the data cut into multiple fields, establishes a data distribution map and an association relationship diagram between the fields, and identifies a correlation between the data fields based on the data distribution graph and the association relationship graph;

The present invention also provides a computer readable medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements a data compression storage method in accordance with each of the above embodiments.

The present invention also provides a computer device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements the data compression storage described above when the computer program is executed The steps of the method.

As described above, according to the data compression storage method and the data compression storage device of the present invention, different compression methods can be adopted in consideration of different data contents, which can effectively improve data compression efficiency, compared to general data compression tools such as GZIP and SNAPPY. , there is a significant increase in data compression rate.

The above examples mainly illustrate the data compression storage method and data compression storage device of the present invention. Although only a few of the specific embodiments of the present invention have been described, it is understood that the invention may be embodied in many other forms without departing from the spirit and scope of the invention. Accordingly, the present invention is to be construed as illustrative and not restrictive, and the invention may cover various modifications without departing from the spirit and scope of the invention as defined by the appended claims With replacement.

Claims

A data compression storage method, comprising the steps of:

Splitting the steps to split the raw data into multiple fields;

The compression step compresses and compresses the compressed data for different fields based on different data contents.
A data compression storage method according to claim 1, wherein

In the compressing step, the relationship between the determination fields is determined to be strong, and the compression policy is set according to the strength of the association relationship.
A data compression storage method according to claim 1, wherein

The compressing step includes the following sub-steps:

Perform content analysis on data cut into multiple fields to establish associations between fields;

For a single field, different compression policies are used for different data content for compressed storage.
The data compression storage method according to claim 1, wherein said compressing step comprises the following substeps:

Perform content analysis on data cut into multiple fields, establish a data distribution map and an association relationship diagram between the fields, and identify correlations between the data fields based on the data distribution graph and the association graph;

The multiple fields in which the correlation relationship exists are combined, and for the combined fields, different compression policies are used for different data contents for compression storage.
The data compression storage method according to claim 1, wherein said compressing step comprises the following substeps:

Perform content analysis on data cut into multiple fields, establish a data distribution map and an association relationship diagram between the fields, and identify correlations between the data fields based on the data distribution graph and the association graph;

For a single field, different compression schemes are used for different data content for compression storage, and on the other hand, multiple fields of correlation are also combined, and for the combined fields, different data content is used for different combinations. The compression strategy is compressed for storage.
A data compression storage method according to claim 3, wherein

As a compression strategy, binary storage is used for enumerated string fields, and string values are converted to integers or floating points for compression storage.
A data compression storage method according to claim 4, wherein

As a compression strategy, the short fields are combined and then compressed, the approximate fields are compressed and the compressed information is compressed, and only one of the fields is reversed for the information, and the compression relationship exists between the fields. Stored after the information.
A data compression storage method according to claim 5, wherein

As a compression strategy, binary storage is used for the enumerated string field, the string value is converted to an integer or floating point for compression storage, the short field is combined and then compressed, and the approximate field is compressed with redundant information. Perform compressed storage, store only one of the reverse order of information between fields, and store the compressed redundant information for the inclusion relationship between the fields.
The data compression storage method according to any one of claims 1 to 8, further comprising: a mapping relationship storing step of establishing a mapping relationship between the original data and the compressed data and storing the mapping relationship.
A data compression storage device, comprising:

a segmentation module for splitting raw data into multiple fields;

The compression module is configured to compress and store the compressed data according to different data content by using different compression policies for different fields.
A data compression storage device according to claim 10, wherein

The compression module determines whether the association relationship between the fields is strong or not, and sets a compression policy according to the strength of the association relationship.
The data compression storage device according to claim 10, further comprising:

The mapping relationship storage module is configured to establish a mapping relationship between the original data and the compressed data and store the relationship.
The data compression storage device according to any one of claims 10 to 12, wherein said compressing step comprises the following sub-modules:

The content analysis sub-module performs content analysis on data cut into multiple fields, and establishes an association relationship between the fields;

Compression sub-module, for a single field, uses different compression strategies for different data content for compressed storage.
The data compression storage device according to claim 1, wherein the compression module is provided with:

The content analysis sub-module performs content analysis on data cut into multiple fields, establishes a data distribution map and an association relationship diagram between the fields, and identifies a correlation relationship between the data fields based on the data distribution graph and the association relationship graph;

The compression sub-module combines multiple fields with related relationships, and uses different compression policies for different data contents for compressed storage for the combined fields.
The data compression storage device according to claim 1, wherein the compression module is provided with:

The content analysis sub-module performs content analysis on data cut into multiple fields, establishes a data distribution map and an association relationship diagram between the fields, and identifies a correlation relationship between the data fields based on the data distribution graph and the association relationship graph;

The compression sub-module, for a single field, uses different compression policies for different data contents to be compressed and stored, and also combines multiple fields of related relationships, and uses different compression strategies for different data contents for the combined fields. Perform compressed storage.
A computer readable medium having stored thereon a computer program, wherein the computer program is executed by a processor to implement the data compression storage method according to any one of claims 1 to 9.
A computer device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor executes the computer program to implement any one of claims 1-9 The steps of the data compression storage method.