CN110798222B

CN110798222B - Data compression method and device

Info

Publication number: CN110798222B
Application number: CN201910925550.7A
Authority: CN
Inventors: 侯满
Original assignee: Beijing Inspur Data Technology Co Ltd
Current assignee: Beijing Inspur Data Technology Co Ltd
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2022-04-22
Anticipated expiration: 2039-09-27
Also published as: CN110798222A

Abstract

The embodiment of the application provides a data compression method and device. The method comprises the following steps: analyzing the data of the index database, and selecting target data from the data; generating a compression configuration file according to the target data; and compressing the target data according to the compression configuration file. Cold data or regular traffic data may be selected from the index library as the target data. And generating a compression configuration file comprising time nodes and a compression algorithm according to the characteristics of the target data. Meanwhile, the method further comprises a decompression method, decompression can be carried out in advance aiming at regular service data, and the cluster index efficiency is improved. The data are compressed and decompressed regularly through a compression algorithm with a high compression rate, the availability of an index set is ensured, the disk occupation of the data is reduced as much as possible, and the storage efficiency of the data is improved.

Description

Data compression method and device

Technical Field

The present application relates to the field of data processing, and in particular, to a data compression method and apparatus.

Background

Currently, with the development of computer technology, the requirement of data search is higher and higher. The solr cloud is a distributed search scheme, is high-performance, is developed by adopting Java and is a Lucene-based full-text search server. Meanwhile, the method expands the search engine, provides richer query languages than Lucene, realizes configurability and expandability, optimizes the query performance, provides a perfect function management interface, and is a very excellent full-text search engine. Enterprise-level data searching can be achieved. Large index, highly concurrent search requests can be handled.

In the existing solr cluster system, data can be stored in a database local to a server. Because the amount of stored data is large, the solr data storage can store the stored local data after compressing the stored local data by using an LZ4 compression algorithm, so that the storage space is saved. The LZ4 compression algorithm has the characteristic of high compression/decompression rate, and can quickly respond to query requests when index queries are carried out.

However, the LZ4 algorithm reduces the compression ratio of the algorithm in order to increase the compression rate. I.e. the compressed data still occupies a large storage space.

Disclosure of Invention

In view of this, embodiments of the present application provide a data compression method and apparatus, which are intended to perform secondary compression on part of data in a solr cluster, so as to further save a storage space.

In order to achieve the purpose, the invention provides the following technical scheme:

a method of data compression, the method comprising:

analyzing the data of the index database, and selecting target data from the data; wherein the target data comprises cold data and/or regular business data; the cold data is data with the use frequency lower than a threshold value, and the processing of the regular service data has a definite time law.

Generating a compression configuration file according to the target data; wherein the compression profile comprises a compression algorithm having a high compression ratio;

and compressing the target data according to the compression configuration file.

Optionally, the compression configuration file includes compression algorithm compression setting parameters of a high compression rate; the compression setting parameters comprise a target data storage position and a compression configuration file

Optionally, the compressing the target data according to the compression configuration file includes:

triggering a compression starting instruction at the compression time node;

after the compression starting instruction is triggered, searching target data according to the target data storage position;

and compressing the target data according to the compression algorithm.

Optionally, the compression profile further comprises a decompression time node.

Optionally, the method further comprises:

triggering a decompression starting instruction at the decompression time node;

after the decompression starting instruction is triggered, searching target data according to the target data storage position;

and decompressing the target data according to the compression algorithm.

Optionally, the compression algorithm comprises a gzip, lzo compression algorithm with a high pressure rate.

Optionally, the method is applied to a solr storage cluster.

An apparatus for data compression, the apparatus comprising:

the data selection module is used for selecting the target data;

the compression control module is used for generating the compression configuration file and starting the compression module at the compression time node;

and the compression module is used for compressing the target data.

Optionally, the compression control module comprises:

the parameter generation module is used for generating the compression configuration file;

and the instruction triggering module is used for sending the compression starting instruction to the compression module at the compression time node.

Optionally, the apparatus further comprises:

the decompression control module is used for sending the decompression starting instruction to the decompression module at the decompression time node;

and the decompression module is used for decompressing the real-time target data after receiving the decompression starting instruction. The embodiment of the application provides a data compression method and device. Target data may be selected from an index library. And generating a compression configuration file according to the characteristics of the target data. Wherein the target data may be cold data that is accessed infrequently or data with obvious traffic rules. Meanwhile, the method further comprises a decompression method, decompression can be carried out in advance aiming at regular service data, and the cluster index efficiency is improved. The data are compressed and decompressed regularly through a compression algorithm with a high compression rate, the availability of an index set is ensured, the disk occupation of the data is reduced as much as possible, and the storage efficiency of the data is improved.

Drawings

To illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a diagram of a data compression method according to an embodiment of the present application.

Fig. 2 is a flowchart of a data compression method according to an embodiment of the present application.

Fig. 3 is a flowchart of a data pre-decompression method according to an embodiment of the present application.

Fig. 4 is a flowchart illustrating data compression according to an embodiment of the present application.

Fig. 5 is a diagram of a data compression apparatus according to an embodiment of the present application.

Detailed Description

For data in a cluster, there may be three types of cold data, warm data, and hot data according to the frequency of use. The hot data is generally real-time data, and is often required to be accessed in the indexing process, and the access frequency of the part of data is high in demand, namely, instructions such as searching and calling are often received. For thermal data, the access frequency can range from milliseconds to hours. Due to the fact that the use frequency of the system is high, the self-contained compression algorithm of the solr cluster can obtain a small storage space on the premise that the query speed is improved, and secondary compression is not needed.

The cold data refers to data with a low access frequency, and may be data before a long time, and the common cold data may include bank certificates, tax certificates, medical files, movie and television data, and the like. Cold data is typically offline data, and may be backup for disaster recovery or data that legal regulations must retain for a period of time. Cold data can be compressed for the second time due to low access frequency (>1 day), and the cold data is compressed by some compression algorithms with high compression rate, so that the maximum utilization efficiency of the storage space is realized. And the compression/decompression characteristics of the compression algorithm need not be considered due to its infrequently accessed nature.

Warm data refers to data that is intermediate between cold data and hot data, with relatively low access frequency but higher than cold data.

In addition, in cluster storage, there are often some data with obvious business rules. The data can be accessed at a specific time to complete the corresponding task. Such as some configuration or network information stored in the cluster, may require a fixed period of time for configuration update detection, etc. For such data, access requests are typically initiated to it only at fixed times. It can be compressed after the search is completed and decompressed at the point of use.

In the prior art, a unified compression algorithm is adopted for compressing cold data, temperature data and hot data. The compression algorithm is fast but low in compression rate, and still occupies a large amount of space of a disk after compression. The efficiency of cluster storage is reduced.

In order to provide an implementation scheme for further improving cluster storage efficiency, embodiments of the present application provide a data compression method and apparatus, and a preferred embodiment of the present application is described below with reference to the accompanying drawings of the specification.

Fig. 1 is a diagram of a data compression method according to an embodiment of the present application, including:

101: and analyzing the data of the index database, and selecting target data from the data.

In this embodiment, before data compression, data in the index database may be analyzed, and target data may be selected from the data. The index library is similar to a retrieval table when a dictionary is looked up, or a library bibliography directory, and all data in the cluster can be directly inquired from the index library. The index database can be associated with data, and can store related information such as specific positions of the data, recent access frequency and time, data types and the like. Meanwhile, the index database can only store relevant information of the data without storing corresponding data, so that the occupation of storage space is reduced, and the query efficiency is improved. The index database does not contain all data, so that the analysis efficiency is high, and the server can quickly analyze the data information stored in the index database and select the target data.

The process of selecting the target data in this embodiment may include analyzing data information in the index database, and selecting cold data with a low recent access frequency or data with an obvious business rule as the target data. Specifically, the server may analyze the data information in the index repository at a certain time in combination with the current service scenario, and select, for example, daily incremental data that is updated and indexed at a fixed time every day or data with a low usage frequency as the target data.

The operation of analyzing the index database data and selecting the target data in this embodiment can effectively select cold data suitable for secondary compression and data with a definite business rule from a large amount of data stored in the cluster. The secondary compression is carried out on the part of data, so that the use efficiency of the storage space of the cluster can be further improved, and the use of the storage space is reduced. Meanwhile, the step of selecting the target data can also screen out hot data which does not need to be compressed, and the access frequency of the part of data is high, and retrieval instructions can be received frequently. If the data is compressed for the second time, the data needs to be decompressed for the second time during access, which may not only occupy the computing power of the server, but also increase the response time of the retrieval instruction, thereby causing the retrieval function of the cluster to be affected. Therefore, in the embodiment, analyzing the index database data and selecting the target data can ensure that subsequent operations do not affect the availability of the cluster and the query retrieval efficiency.

102: and generating a compression configuration file according to the target data.

After analyzing the index database data, the server may further analyze the target data to generate a compressed configuration file. The compression configuration file in this embodiment may include a compression time node and compression setting parameters. And the compression starting time node is the starting time of the compression task. When the server detects that the current time reaches the compression time node, a compression starting instruction can be sent out to start a data compression process.

The compression setting parameters may include the storage location of the target data and the compression algorithm employed. The storage location of the target data refers to a specific storage location of the target data in the cluster, and is a data address obtained by analyzing the index database in step 101. The compression algorithm is an algorithm used for subsequent compression, and since the cold data and the regular service data are compressed twice in the embodiment, some compression algorithms with high compression rate, such as gzip and lzo, may be used in the embodiment. Because the targeted target data is cold data or regular business data and the calling times are less, the compression/decompression rate can be sacrificed to a certain extent, and the maximum utilization efficiency of the storage space is obtained. Further, for cold data, a compression algorithm with a higher compression rate may be employed. For regular traffic data, a compression algorithm with a relatively fast compression rate may be used, since it needs to be decompressed in the execution traffic room.

103: and compressing the target data according to the compression configuration file.

After obtaining the compression configuration file corresponding to the target data, the target data may be compressed.

The data compression method provided by the embodiment can select the target data from the index database. And generating a compression configuration file according to the characteristics of the target data. Wherein the target data may be cold data that is accessed infrequently or data with obvious traffic rules. The data are compressed and decompressed regularly through a compression algorithm with a high compression rate, the availability of an index set is ensured, the disk occupation of the data is reduced as much as possible, and the storage efficiency of the data is improved.

To further illustrate the process of compressing the target data in the present application, fig. 2 provides a flowchart of a data compression method in the present application, which includes:

201: and triggering a compression starting instruction at the compression time node.

In this embodiment, when the system detects that the time reaches the compression time node, a compression start instruction may be triggered. Wherein the compression time node can be set for regular traffic data. For example, a regular service data is updated at 3-4 am every day, the compression time node of the regular service data may finish updating at any time after 4 am, for example, 4: 05. At which compression time node the compression task is started. For cold data, a timing update mechanism can be generally adopted, data in the index database is analyzed at regular time, and secondary compression is performed on the cold data detected to be uncompressed.

In the embodiment, the compression starting instruction is triggered at the appointed compression time node, so that the compression task is periodically performed, regular service data and cold data are ensured to be in a secondary compression state at any time under the condition that the regular service data and the cold data are not used, and the storage space is saved.

202: and after the starting instruction is triggered, searching target data according to the target data storage position.

After the start instruction is issued, the server may search for the target parameter according to the target data storage location. In this embodiment, the index database data is permitted to be distributed, and the storage location of the target data is obtained, instead of the specific target data, so that in the process of executing the compression task, the corresponding target data can be found from the cluster by compressing the storage location of the target data included in the setting parameter.

203: and compressing the target data according to the compression algorithm.

After the actual storage location of the target data is found, the target data may be compressed by using a compression algorithm included in the compression setting parameter in this embodiment. The compressed data may be stored in the location where the original target data is located, or may be stored in other locations, which is not specifically limited in this embodiment.

The above two embodiments describe the data compression method in the present application, and for cold data, the cold data may not be subjected to subsequent processing after being compressed, and may be decompressed when needed. Although the speed of re-decompression is slow, the cold data access frequency is low, and the performance impact on the cluster as a whole is low. However, for regular service data, if the data is temporarily decompressed during use, the overall speed of the cluster is greatly influenced, and therefore, the data pre-decompression method is provided.

Fig. 3 is a flowchart of a data pre-decompression method according to an embodiment of the present application, including:

301: and triggering a decompression starting instruction at the decompression time node.

First, in order to implement pre-decompression of target data, a decompression time node and a decompression start instruction may be included when setting a compression profile. The decompression time node is the time triggered by the decompression starting instruction.

And after the time reaches the decompression time node, the decompression starting instruction is triggered. The decompression time node here can be set for regular traffic data. For example, if a regular service data is updated at 3-4 am every day and 5 minutes is required for decompressing the target data, the decompression time node of the regular service data may be a period of more than 5 minutes before the update is finished at 3 am. For example, the decompression task can be triggered and started at 2:53 to decompress the target data. So as to ensure that the data can be normally processed at the point 3 without influencing the normal service operation of the cluster.

302: and after starting to decompress the starting instruction, selecting target data according to the target data storage position.

After the start instruction is issued, the server may search for the target parameter according to the target data storage location. In this embodiment, the index database data is permitted to be distributed, and the storage location of the target data is obtained, instead of the specific target data, so that in the process of executing the decompression task, the corresponding target data can be found from the cluster by decompressing the target data storage location included in the setting parameter.

303: and decompressing the target data according to the compression algorithm.

After the actual storage location of the target data is found, the compression algorithm included in the compression setting parameter may be used to decompress the target data in this embodiment. The decompressed data may be stored in the location where the original target data is located, or may be stored in other locations, which is not specifically limited in this embodiment.

To further explain the data compression and decompression processes in the present application as a whole, fig. 4 shows a data compression execution flowchart provided in an embodiment of the present application, which mainly includes:

401: the task starts.

After applying the method in a cluster, the server may send an instruction to start setting tasks and parameters.

In this embodiment, the setting task and the parameter may include analyzing all data in the index library, and selecting cold data and regular business data for compression. After the analysis is finished, a large amount of target data can be obtained, and a separate compression configuration file can be generated for each target data. Different compression algorithms and time nodes are selected for different data types and access frequencies, etc.

402: the compression control is started.

After obtaining the compression profile, the server may store the parameters in the form of compression tasks and parameters. Wherein the compression task may be a process in the server, and the compression task is executed according to the compression configuration file when the time is detected to reach the compression time node or the decompression time node.

403: the response is compressed.

In this embodiment, the data compression or data decompression task may be determined according to a background process of the server. And when the background process detects that the current time is a time node, triggering a compression or decompression operation instruction.

The step of triggering the instruction in this embodiment may further include searching for corresponding template data from the cluster, creating and executing a corresponding template data compression process or data decompression process, creating a process monitor to monitor the compression or decompression process, and the like. These operation steps are all operation steps commonly used in the computer field, and are not limited too much in this embodiment.

404: and (5) compression control.

In the process of compression and decompression, the server can also monitor and fine-tune the compression and decompression according to the service scene.

For example, when the server compresses a piece of cold data and receives a query or call instruction for the part of cold data sent by the user, the server can directly perform a terminal compression process to send the original data to the user to ensure the index efficiency.

405: and (5) completing the task.

After the compression of the data is completed, the server may process the original data before the compression, and may delete or backup the original data to another storage cluster. This embodiment is not further limited.

The present embodiment describes the data compression method provided by the present application in detail by way of a flow, and the key point of the present embodiment is the step of "compression control start". The data and service scenes can be analyzed, cold data with low use frequency and regular service data with regular use time are selected as target data, corresponding compression time nodes and decompression time nodes can be set, and corresponding operations can be completed at the corresponding time nodes. Saving storage disk space as much as possible while preserving cluster index efficiency as much as possible

Fig. 5 is a diagram of a data compression apparatus according to an embodiment of the present invention, including:

501, a data selection module:

for selecting the target data.

502: a compression control module:

and starting a compression module for the compression configuration file and at the compression time node.

503: a compression module:

for compressing the target data.

Each module in this embodiment may have a signal transceiving module or interface for signal transmission between the modules. For example, the data selection module may send the target data storage location and the compression time node to the compression control module through the interface, and the compression control module may select a corresponding compression algorithm according to the received target data storage location and the compression time node. And sending a data compression instruction to a compression module at the compression time node, and informing the compression module to compress the target data on the storage position by adopting a compression algorithm contained in the compression configuration file.

Each of the servers in this embodiment may be located in the same server, or may be located in different servers; the signal transmission between the modules may adopt an interface, or may also adopt a network transmission or other methods to perform transmission, which is not specifically limited in this embodiment.

In one embodiment, the compression control module includes:

In one embodiment, the apparatus further comprises:

and the decompression module is used for decompressing the real-time target data after receiving the decompression starting instruction.

For regular traffic data, the server may need to decompress it at a certain preset time point to complete the corresponding task. The operation of the part is similar to that of the compression control module, and the compression instruction is replaced by the decompression instruction.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a router) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described device and system embodiments are merely illustrative, in which the first user and the second user may or may not be physically separate, and the component that is the initial task template may or may not be a code template. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only an exemplary embodiment of the present application, and is not intended to limit the scope of the present application.

Claims

1. A method of data compression, the method comprising:

analyzing the data of the index database, and selecting target data from the data; the target data comprises regular business data, and the processing of the regular business data has a definite time rule;

generating a compression configuration file according to the target data; the compression configuration file comprises a compression algorithm, a compression time node and a decompression time node, and the compression time node and the decompression time node are set aiming at the time law of the regular service data;

2. The method of claim 1, wherein the compression profile further comprises compression setting parameters for setting the compression algorithm, the compression setting parameters comprising a target data storage location.

3. The method of claim 2, wherein compressing the target data according to a compression profile comprises:

triggering a compression starting instruction at the compression time node;

and compressing the target data according to the compression algorithm.

4. The method of claim 1, further comprising:

triggering a decompression starting instruction at the decompression time node;

and decompressing the target data according to the compression algorithm.

5. The method of claim 1, wherein the compression algorithm comprises a gzip, lzo compression algorithm having a high compression ratio.

6. The method of claim 1, applied to a solr storage cluster.

7. An apparatus for compressing data, the apparatus comprising:

the data selection module is used for analyzing the data of the index database and selecting target data from the data, wherein the target data comprises regular business data, and the regular business data has a definite time rule in processing; the compression control module is used for generating a compression configuration file according to the target data; the compression configuration file comprises a compression algorithm, a compression time node and a decompression time node, and the compression time node and the decompression time node are set for the regular service data;

and the compression module is used for compressing the target data according to the compression configuration file.

8. The apparatus of claim 7, wherein the compression control module comprises:

and the instruction triggering module is used for sending a compression starting instruction to the compression module at the compression time node.

9. The apparatus of claim 7, further comprising:

the decompression control module is used for sending a decompression starting instruction to the decompression module at the decompression time node;

and the decompression module is used for decompressing the target data after receiving the decompression starting instruction.