WO2020137611A1

WO2020137611A1 - Data compression method

Info

Publication number: WO2020137611A1
Application number: PCT/JP2019/048912
Authority: WO
Inventors: 哲也福田; 佐藤　賢一; 圭大村
Original assignee: 日本電信電話株式会社
Priority date: 2018-12-28
Filing date: 2019-12-13
Publication date: 2020-07-02
Also published as: JP2020108045A

Abstract

The present invention achieves a balance between saving memory and saving disk capacity. A data compression unit 12 fetches uncompressed data stored in time series from a memory 3, compresses the data, and generates a first block comprising the compressed compression data, and a data storage unit 13 re-stores the first block in the memory 3. A block joining unit 14 joins a prescribed number of first blocks, and a joined block compression unit 15 re-compresses the joined prescribed number of first blocks and generates a second block comprising the re-compressed recompression data. A search information generation unit 16 generates a representative key from a plurality of first block unique keys included in the second block, applies the generated representative key to the second block, and generates and retains data search information associating a second block representative key and the plurality of first block unique keys included in the second block. A data write unit 17 writes the second block to which the representative key was applied into a file on a disk 5.

Description

Data compression method

The present invention relates to a database device technology for collecting and accumulating a large amount of information data generated by an IoT device under resource-saving conditions. In particular, it relates to the processing on the memory until the database process receives the information data via the communication network, stores it in the memory, and writes it to the disk.

In IoT (Internet of Things), the place where an IoT device such as a sensor or device generates information data is called an “edge” in contrast to the cloud. At the edge, there is often a strong constraint on the scalability of computing resources and storage resources as compared with the cloud. On the other hand, even at the edge, it is necessary to process a large amount of information data generated by the IoT device. Therefore, a technique of a database device that collects a large amount of information data generated by an IoT device and writes the data on a disc under resource-saving conditions is under study.

The conventional technology has two common features. One is to store data in a write-once type data structure in a memory for a certain period in order to speed up the writing speed to the disc. The write-once data structure is a data structure in which data is added sequentially in time series. Secondly, since a large amount of information data is written to the disk, it is possible to improve the compression rate by collecting data for a long time in units of data generators of information data and compressing the data at once.

For example, in Non-Patent Document 1, after storing information data in a write-once data structure on a memory and storing a predetermined amount on the memory, a compression process is performed for each data generation source to generate a block of data, A method of writing the generated block to the disk is disclosed.

In Non-Patent Document 2, when collecting information data, a write-once type data structure called a chunk is stored in the memory as one unit while sequentially compressing in the order of arrival of the information data in units of predetermined data points. The chunk is generated in the unit of the data generation source and has a fixed size. Then, the information file is written to the disk in chunk units while matching the memory usage. The chunk is synonymous with the block in Non-Patent Document 1.

As described above, at the edge, a process of collecting and accumulating a large amount of information data generated by the IoT device is performed while there are strong restrictions on the expandability of the calculation resource and the storage resource. Therefore, a method capable of executing collection and storage of information data at the edge under resource saving conditions is required. However, the conventional technology cannot achieve both memory saving and disk saving capacity.

According to the conventional technology, in order to achieve a high data compression rate, long-term data is held in a single unit on the memory, so a large amount of memory capacity is required. On the other hand, under the memory-saving condition, long-term data cannot be stored in the memory, and the data compression rate deteriorates, so that the disk usage increases.

For example, in Non-Patent Technology 1, in order to achieve a high data compression rate, especially when the number of data generators is large, it is necessary to increase the memory area used for collecting information data on the memory. .. However, since the information data is held in a non-compressed state, the memory usage per data point is inefficient, and a large amount of memory is consumed. In addition, when the memory area is reduced, the data compression rate deteriorates.

In Non-Patent Document 2, since the size of the block (chunk) is fixed, the number of data generators increases, and if the block size is not set small, the memory will be tight. On the other hand, when the block size is reduced, it is not possible to collect data for a long period of time in the block, and low compression rate data is written to the disk in block units, which leads to an increase in disk capacity.

The present invention has been made in view of the above circumstances, and an object thereof is to achieve both memory saving and disk saving capacity.

The data compression method of the present invention is the data compression method performed by a data compression apparatus, comprising: a first step of compressing uncompressed data stored in a memory, and re-storing the compressed compressed data in the memory; A second step of combining and recompressing the compressed data, adding key information to the recompressed recompressed data, and writing to the disk, key information of the recompressed data, and compression included in the recompressed data. It is characterized by performing a third step of generating data search information in which key information indicating a data generation source of uncompressed data related to data is associated.

In the above data compression method, in the first step, the uncompressed data is compressed for each data generator of the uncompressed data, and in the second step, regardless of whether the data generators are the same or not. When the number of compressed data related to a plurality of data generators reaches the predetermined number, the predetermined number of compressed data are combined and compressed.

In the above data compression method, in the first step, a part of the uncompressed data is compressed, and in the second step, the entire uncompressed data is compressed. To do.

According to the present invention, both memory saving and disk capacity saving can be realized.

It is a figure which shows the functional block structure of a data compression apparatus. It is a figure which shows the processing flow of a data compression method. It is a figure which shows the mode of a storage process of uncompressed data, a compression process, and a re-storing process of compressed data. It is a figure which shows the mode of a recompression process of compressed data. It is a figure which shows the mode of a production|generation process of a representative key and time information. It is a figure which shows the example of the data layout on a file.

In the present invention, the above-mentioned problem of resource saving is approached from the viewpoint of correcting the deterioration of the compression rate when the memory is saved, and a method of two-stage compression is proposed. Specifically, in order to reduce the memory usage, the first compression process that compresses the data in the memory is performed, and in order to reduce the disk usage, multiple compressed data are combined and recompressed for the second time. Performs compression processing. Also, in order to search the compressed data obtained in the first compression process, key information is added to the recompressed data obtained in the second compression process, and the key information of the recompressed data and the recompressed data are included. The data search information associated with the key information indicating the data generation source of the uncompressed data related to the compressed data to be generated is generated.

As described above, according to the present invention, since the two-stage compression method that combines the two compression processes corresponding to both the memory and the disk and the generation process of the data search information is used, the memory search can be performed without impairing the searchability of the data. Both disk and disk usage can be reduced. Hereinafter, an embodiment for carrying out the present invention will be described with reference to the drawings.

FIG. 1 is a diagram showing a functional block configuration of a data compression device 1 according to the present embodiment. The data compression apparatus 1 is an apparatus that operates in the database apparatus 100 including the memory 3 and the disk 5 and receives and compresses information data transmitted by the plurality of IoT devices 300 to the database apparatus 100 via the communication network.

The IoT device 300 is, for example, a sensor or a device, and outputs a center value detected by itself as information data. The database device 100 may use a device such as an IoT gateway having a small memory and disk capacity, or a general personal computer having a small memory and disk capacity.

First, the function of the data compression device 1 will be described. As shown in FIG. 1, the data compression device 1 mainly includes a data reception unit 11, a data compression unit 12, a data storage unit 13, a block combination unit 14, a combined block compression unit 15, and search information generation. It is configured to include a unit 16 and a data writing unit 17.

The data receiving unit 11 has a function of receiving the information data generated by the IoT device 300 and storing the received information data in the memory 3 as it is in a write-once data structure in the order of reception. For example, the data receiving unit 11 adds a plurality of information data output from a plurality of IoT devices 300 or a data string constituting one information data to the memory 3 in a non-compressed state in the order of reception in time series.

The data compression unit 12 extracts the time-series stored information data (uncompressed data) from the memory 3, and extracts the extracted uncompressed data for each data generator of the uncompressed data (for example, for each IoT device or in the IoT device). (For each application), and has a function of generating a block (hereinafter, referred to as a first block) including compressed compressed data. The method of generating the first block may be a method of compressing after storing the information data in the memory 3 by an arbitrary amount or a predetermined amount (which can be set arbitrarily), or by sequentially compressing the information data in the order of arrival of the information data to obtain a fixed size. It may be stored in blocks.

The data storage unit 13 has a function of re-storing the first block generated by the data compression unit 12 in the memory 3.

The block combining unit 14 has a function of extracting a predetermined number of first blocks from the memory 3 and combining the extracted predetermined number of first blocks. For example, the block combination unit 14 combines the arbitrary number or the predetermined number of first blocks after the number of the first blocks stored again in the memory 3 reaches the arbitrary number or the predetermined number (which can be set arbitrarily). .. The method of selecting the plurality of first blocks to be combined may be selected for each data generation source, may be selected from a plurality of data generation sources, or may be selected among data generation sources of arbitrary combinations. You may choose from. The method of selecting from a plurality of data generation sources regardless of the type of data generation source is highly efficient.

The combined block compression unit 15 has a function of recompressing a predetermined number of first blocks combined by the block combination unit 14 and generating a block (hereinafter, second block) composed of recompressed recompressed data. The compression method used by the combined block compression unit 15 may be the same as the compression method used by the data compression unit 12, but different compression methods or a plurality of compression methods may be used in consideration of the characteristics and properties of the data in order to increase the data compression rate. It is desirable to use a compression method that is a combination of compression methods. For example, paying attention to the data structure of the uncompressed data, the data compression unit 12 compresses a part of the data sequence forming the uncompressed data, and the combined block compression unit 15 determines the data forming the compressed data. There is a method of performing compression on the entire column.

Here, the second block will be explained. In the second block, a plurality of first blocks having different unique keys indicating a data generation source and different time ranges are stored. Conventionally, it was possible to directly access the first block by using a unique key. However, in the present invention, since the first block existing in the second block cannot be directly accessed, the unique key can correspond to the second block. Information and key information need to be defined.

Therefore, first, the representative key is generated from the unique keys of the plurality of first blocks that are the generation sources of the second block, and the minimum time of each first block and the minimum time of the whole of the plurality of first blocks are calculated from the maximum time. Time information about the time and the maximum time is calculated, and a representative key unique to the generated and calculated second block, time information, and the like are added to the second block. In addition, the unique key of each first block included in the second block and the time information of the minimum time and the maximum time of each first block are added to the second block. Further, the data search information (search map) in which the representative key of the second block and the unique key of each first block included in the second block are associated with each other is generated and held. After that, the second block is written to the file in the disk 5.

Then, when the desired first block is searched from the file of the disk 5, the representative key corresponding to the unique key in the search query is searched from the search map to identify the second block to which the searched representative key is added. , The first block corresponding to the unique key is acquired from the specified second block and decompressed. As a result, the information data written on the disc 5 can be obtained.

Therefore, the search information generation unit 16 generates a representative key from the unique keys of the plurality of first blocks included in the second block generated by the combined block compression unit 15, and attaches the generated representative key to the second block. In addition, it has a function of generating and holding a search map in which the representative key of the second block and the unique keys of the plurality of first blocks included in the second block are associated with each other. Further, the data writing unit 17 has a function of writing the second block to which the representative key is added to the file in the disk 5.

Up to this point, the functions of the data compression device 1 have been described. The functional block configuration of the data compression device 1 described above is an example. For example, the data receiving unit 11, the data compressing unit 12, and the data storing unit 13 are realized by one memory reducing compressing unit, and the block combining unit 14, the combined block compressing unit 15, and the search information generating unit 16 are provided. And the data writing unit 17 may be realized by one disk reduction compression unit.

In this case, the memory reduction compression unit has a function of compressing the non-compressed data stored in the memory 3 and re-storing the compressed compressed data in the memory 3. The disk reduction compression unit combines a predetermined number of compressed data and recompresses it, adds a representative key to the recompressed recompressed data, and writes it to the disk, and also creates a representative key of the recompressed data and the recompressed data. It is provided with a function of generating a search map in which a unique key indicating a data generation source of uncompressed data related to the included compressed data is associated.

The data compression device 1 can be realized by a computer including a CPU, a memory, an input/output interface, a communication interface, and the like. It is also possible to create a data compression program for causing a computer to function as the data compression device 1 and a storage medium for the data compression program.

Next, a data compression method performed by the data compression device 1 will be described. FIG. 2 is a diagram showing a processing flow of the data compression method.

Step S1;
First, the data receiving unit 11 receives the information data received by the database apparatus 100 from the IoT device 300, and stores the received information data in the uncompressed order in the order of reception in the memory 3 in time series.

Step S2;
Next, the data compression unit 12 measures the size of the information data (non-compressed data) stored in the memory 3 for each data generation source (for each unique key) of the non-compressed data, and reaches the predetermined amount. Determine whether or not. If the amount of uncompressed data has not reached the predetermined amount, the process returns to step S1, and if it reaches the predetermined amount, the process proceeds to step S3.

Step S3;
Next, the data compression unit 12 compresses the uncompressed data that has reached the predetermined amount, and generates the first block composed of the compressed data that has been compressed. The compression process of step S3 is the first compression process, and, for example, compression with high locality is performed in the data sequence of the uncompressed data (for example, only the values before and after the data sequence are compressed).

Step S4;
Next, the data compression unit 12 generates a unique key indicating a data generation source of the uncompressed data related to the first block, and calculates the minimum time and the maximum time of the uncompressed data related to the first block, respectively. Regarding the minimum time and the maximum time of the uncompressed data, the reception start time and the reception end time of the uncompressed data may be used as they are, or the minimum time may be zero time and the maximum time may be the elapsed time from the reception start time. ..

Step S5;
Next, the data storage unit 13 re-stores the first block generated by the data compression unit 12 in the memory 3. FIG. 3 shows the states of the non-compressed data storage processing, the compression processing, and the compressed data re-storage processing performed in steps S1 to S5. In FIG. 3, the uncompressed data related to the unique keys of key1 to key3 are sequentially stored in time series, and since the amount of uncompressed data related to the unique key of key1 has reached the predetermined amount, the uncompressed data is compressed. It shows how the first block of compressed data is stored again in the memory 3.

Step S6;
Next, the data compression device 1 determines whether or not the number of first blocks has reached a predetermined number. If the number of the first blocks has not reached the predetermined number, the process returns to step S1. By returning to step S1 and repeating steps S1 to S5, a plurality of first blocks compressed to approximately the same size are generated for each unique key. Then, when the number of the first blocks reaches the predetermined number, the process proceeds to step S7.

Note that the number of the first blocks reaches the predetermined number is a recompression condition for starting the second compression process. The recompression condition can be set in units of unique keys or across unique keys. For example, when the total number of first blocks generated for each unique key reaches a predetermined number, when the total number of first blocks generated over all unique keys reaches a predetermined number, any combination unit of unique keys This is the case, for example, when the total number of the first blocks held in 1 reaches a predetermined number. The data compression apparatus 1 executes step S6 using the recompression condition designated by the user.

Step S7;
When the number of the first blocks reaches the predetermined number (when the recompression condition is satisfied), the block combination unit 14 extracts the predetermined number of the first blocks from the memory 3 and extracts the predetermined number of the extracted first blocks. Join.

Step S8;
Next, the combined block compression unit 15 recompresses the predetermined number of first blocks combined by the block combination unit 14, generates a second block composed of the recompressed recompressed data, and is in a writable state on the disk 5. To The compression processing in step S8 is the second compression processing, and, for example, by using a compression program or compression algorithm such as GZIP (GNU Zip) or ZSTD (Zstandard), the data strings included in all the combined compressed data are Compress the whole.

Fig. 4 shows how the compressed data is combined and recompressed in steps S7 to S8. In FIG. 4, among the above-described three types of recompression conditions, “when the total number of first blocks generated over all unique keys reaches a predetermined number” is used, and unique keys of key1 to key3 are used. Since the number of such compressed data has reached the predetermined number of 6, the six compressed data are connected in a row to be recompressed, and the second block composed of the recompressed recompressed data is written to the file in the disk 5. ing.

Step S9;
Next, before writing the second block to the disk 5, the search information generating unit 16 makes an access for enabling access to the first block on the file in the disk 5 including the written second block group. Create information.

Specifically, the search information generation unit 16 first generates a single representative key from the set of unique keys included in the second block. For example, as shown in FIG. 5, the representative key of K1 is generated from the unique keys of key1 to key3. The representative key is generated by, for example, sorting a plurality of unique keys associated with the same representative key, and using the word of the first unique key (key1 out of key1 to key3) in the dictionary order to represent the representative key ( key1→K1) is generated.

Next, the search information generation unit 16 refers to all the minimum time and the maximum time given to each first block in the second block, and determines the smallest minimum time and the largest maximum time among all of them. Set the minimum time and the maximum time for two blocks. For example, as shown in FIG. 5, the minimum time (MinTime1) and the maximum time (MaxTime1) are calculated using the MIM function and the MAX function, respectively.

Then, the search information generation unit 16 adds the representative key and the time range (minimum time, maximum time) as the index information 1 to the second block. The search information generation unit 16 also assigns the unique key and the time range (minimum time, maximum time) of each first block included in the second block to the second block as the index information 2.

Further, the search information generation unit 16 generates and holds a search map in which the representative key of the second block and the unique key of each first block included in the second block are associated with each other. For example, as shown in FIG. 5, a search map such as “m(key1)=m(key2)=m(key3)=K1” is generated.

Step S10;
Finally, the data writing unit 17 writes the second block having the access information generated by the search information generating unit 16 to the file in the disk 5. An image of the data layout in the file is shown in FIG. In FIG. 6, together with one second block (recompressed data) obtained by combining and recompressing six first blocks (compressed data), a unique key of each first block and a time range (minimum time, maximum time) An example of the data layout of the index information 2 indicating the storage position, the representative key of the second block, the time range (minimum time, maximum time), and the index information 1 indicating the storage position is shown.

In step S6, when the recompression condition “when the total number of the first blocks held in an arbitrary combination of unique keys reaches a predetermined number” is used, a special attribute value is added to the write data. Good. Alternatively, the reading pattern may be recorded and the unique keys read at the same timing may be grouped together.

Next, a data search method for searching the disk 5 for data will be described.

When the database device 100 receives a search query including a desired unique key and a time range, the database device 100 converts the unique key in the received search query into a representative key from the search map, and the converted representative key and time range in the search query. The second block corresponding to and is searched on the file of the disk 5. After searching the second block, the database device 100 refers to the index information 2 in the second block to obtain the information of the first block corresponding to the unique key included in the search query. After that, the database device 100 acquires the first block corresponding to the unique key and corresponding to the time range in the search query from the file on the disk 5, decompresses it according to the first compression method, and obtains the information. obtain. If the second compression is performed for each unique key, the procedure for the representative key can be omitted.

As described above, according to the present embodiment, the data compression device 1 performs the first compression process of compressing the uncompressed data in the memory, so that the compressed data can be held in the memory, and the same amount of data can be stored. It is possible to reduce the amount of memory required to hold the in memory.

Also, since the data compression device 1 performs the second compression process in which a plurality of compressed data are combined and recompressed, the compression rate on the disk can be improved. In particular, when compressing a plurality of compressed data across unique keys, the compression rate on the disk can be further improved. That is, it is possible to connect the first blocks to each other, and it is possible to collectively compress blocks having the same unique key or blocks having a plurality of unique keys. Therefore, the compression rate can be improved in a memory-saving situation where only a small block can be held in the memory.

Therefore, it is possible to collect and store data while achieving both memory saving and disk saving capacity.

DESCRIPTION OF SYMBOLS 1... Data compression device 11... Data receiving part 12... Data compression part 13... Data storage part 14... Block combination part 15... Combined block compression part 16... Search information generation part 17... Data writing part 3... Memory 5... Disk ( File)
100... Database device 300... IoT device

Claims

In the data compression method performed by the data compression device,
A first step of compressing uncompressed data stored in memory and re-storing the compressed compressed data in said memory;
A second step in which a predetermined number of the compressed data are combined and recompressed, key information is added to the recompressed recompressed data, and the recompressed data is written to the disk;
A third step of generating data search information in which key information of the recompressed data and key information indicating a data generation source of uncompressed data related to compressed data included in the recompressed data are associated with each other;
A data compression method comprising:
In the first step, the uncompressed data is compressed for each data generator of the uncompressed data,
In the second step,
When the number of compressed data related to a plurality of data generation sources reaches the predetermined number regardless of whether or not the data generation sources are the same, the predetermined number of compressed data are combined and compressed. The data compression method according to claim 1.
In the first step, compression is performed on a part of the uncompressed data,
In the second step,
The data compression method according to claim 1, wherein the entire compressed data is recompressed.