CN107682016B - Data compression method, data decompression method and related system - Google Patents
Data compression method, data decompression method and related system Download PDFInfo
- Publication number
- CN107682016B CN107682016B CN201710884914.2A CN201710884914A CN107682016B CN 107682016 B CN107682016 B CN 107682016B CN 201710884914 A CN201710884914 A CN 201710884914A CN 107682016 B CN107682016 B CN 107682016B
- Authority
- CN
- China
- Prior art keywords
- data
- data blocks
- blocks
- similar
- generate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a data compression method, a data decompression method and a related system, which are used for removing redundant data by migrating and recombining similar data blocks after original data is divided into a plurality of data blocks, thereby improving the compression rate of the data. The method provided by the embodiment of the invention comprises the following steps: dividing original data into a plurality of data blocks; detecting similarity of a plurality of data blocks; sequentially migrating and recombining the similar data blocks to generate recombined data; and compressing the recombined data to generate compressed data. The embodiment also provides a data decompression method and a related system, which are used for improving the compression rate of data.
Description
Technical Field
The invention relates to the technical field of computer data processing, in particular to a data compression method, a data decompression method and a related system.
Background
Data compression is a technical method for reducing the data volume to reduce the storage space and improve the transmission, storage and processing efficiency of the data on the premise of not losing useful information, or for reorganizing the data according to a certain algorithm and reducing the redundancy and storage space of the data.
The current data compression technology is mainly divided into lossy compression and lossless compression, and the existing lossless compression technology is mostly developed by dictionary-based coding technologies LZ77 and LZ 78. The dictionary coding technology mainly adopts a buffer technology based on a sliding window, matches the current character sequence with the character sequence buffered in the sliding window, and if the character sequence is repeated, the character sequence is represented by a relatively short code, so that the redundancy elimination of a character string level is realized.
On one hand, the larger the sliding window is, the easier the redundant data is to be found, so that the redundancy can be eliminated more, but with the increase of the sliding window, the matching search time of the redundant character string also grows exponentially, so that most compression algorithms limit the size of the sliding window, for example, the maximum sliding window of bzip2 is 900 KB; on the other hand, the sliding window is too small, redundant data in different windows cannot be eliminated due to long distance, a large amount of redundant data still exists in the storage system, meanwhile, the character string matching operation of non-redundant data is time-consuming seriously, and the data compression speed in the storage system is reduced.
Disclosure of Invention
The embodiment of the invention provides a data compression method, a data decompression method and a related system, which are used for removing redundant data by migrating and recombining similar data blocks after original data is divided into a plurality of data blocks, thereby solving the problem that data redundancy at a long distance cannot be removed due to the limitation of the size of a sliding window in the traditional compression technology.
One aspect of the present invention provides a method for data compression, including:
dividing original data into a plurality of data blocks;
detecting similarity of a plurality of data blocks;
sequentially migrating and recombining the similar data blocks to generate recombined data;
and compressing the recombined data to generate compressed data.
Optionally, after dividing the original data into a plurality of data blocks and before detecting the similarity of the plurality of data blocks, the method further includes:
and recording the sequence, the offset and the block length of the plurality of data blocks to generate an original file spectrum.
Optionally, after sequentially migrating and reorganizing the similar data blocks to generate reorganized data, and before compressing the reorganized data, the method further includes:
updating the offsets of the data blocks according to the reorganized data to obtain a new original file spectrum;
and compressing the new and original file spectrum to generate a compressed file spectrum.
Optionally, migrating and recombining the similar data blocks in sequence to generate recombined data, including:
sequentially migrating and recombining the similar data blocks to generate a plurality of similar linked lists;
and reading the corresponding data block content from the original data according to the plurality of similar linked lists to generate recombined data.
Optionally, detecting the similarity of the plurality of data blocks includes:
and detecting the similarity of the data blocks by a super characteristic value method, a Simhash method or a Minhash method.
Another aspect of the present invention provides a data decompression method, including:
decompressing the compressed data and the compressed file spectrum to respectively obtain recombined data and a new original file spectrum;
respectively reading a plurality of data blocks from the recombined data according to the sequence, the offset and the block length of the plurality of data blocks recorded by the new original file spectrum;
and sequentially writing the data blocks according to the sequence of the data blocks recorded by the new original file spectrum to obtain original data.
The invention also provides a system for data compression, comprising:
a block unit for dividing original data into a plurality of data blocks;
a detection unit configured to detect similarities of the plurality of data blocks;
the recombination unit is used for sequentially migrating and recombining the similar data blocks to generate recombined data;
and the compression unit is used for compressing the recombined data to generate compressed data.
The invention also provides a system for data decompression, comprising:
the decompression unit is used for decompressing the compressed data and the compressed file spectrum to respectively obtain recombined data and a new original file spectrum;
the reading unit is used for respectively reading a plurality of data blocks from the recombined data according to the offset and the block length of the plurality of data blocks recorded by the new original file spectrum;
and the writing unit is used for sequentially writing the data blocks according to the sequence of the data blocks recorded by the new original file spectrum to obtain original data.
The invention also provides a computer arrangement comprising a processor for executing a computer program stored on a memory, the following steps being implemented:
dividing original data into a plurality of data blocks;
detecting similarity of a plurality of data blocks;
sequentially migrating and recombining the similar data blocks to generate recombined data;
and compressing the recombined data to generate compressed data.
The invention also provides a computer arrangement comprising a processor for implementing the following steps when executing a computer program stored on a memory:
decompressing the compressed data and the compressed file spectrum to respectively obtain recombined data and a new original file spectrum;
respectively reading a plurality of data blocks from the recombined data according to the offset and the block length of a plurality of data blocks recorded by the new original file spectrum;
and sequentially writing the data blocks according to the sequence of the data blocks recorded by the new original file spectrum to obtain original data.
The present invention also provides a computer-readable storage medium having stored thereon a computer program for, when executed by a processor, performing the steps of:
dividing original data into a plurality of data blocks;
detecting similarity of a plurality of data blocks;
sequentially migrating and recombining the similar data blocks to generate recombined data;
and compressing the recombined data to generate compressed data.
The invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, is adapted to carry out the steps of:
decompressing the compressed data and the compressed file spectrum to respectively obtain recombined data and a new original file spectrum;
respectively reading a plurality of data blocks from the recombined data according to the offset and the block length of a plurality of data blocks recorded by the new original file spectrum;
and sequentially writing the data blocks according to the sequence of the data blocks recorded by the new original file spectrum to obtain original data.
According to the technical scheme, the embodiment of the invention has the following advantages:
according to the method, the original data are divided into a plurality of data blocks, the similarity of the data blocks is detected, the similar data blocks are migrated and recombined to generate recombined data, and then the recombined data is compressed to obtain compressed data.
Drawings
FIG. 1 is a process diagram of a data compression method;
FIG. 2 is a diagram of an embodiment of a data compression method according to an embodiment of the present invention;
FIG. 3 is a diagram of another embodiment of a data compression method according to an embodiment of the present invention;
FIG. 4 is a schematic structural organization of a file spectrum;
FIG. 5 is a process diagram of a data decompression method;
FIG. 6 is a diagram of an embodiment of a data decompression method according to an embodiment of the present invention;
FIG. 7 is a diagram of an embodiment of a data compression system in accordance with an embodiment of the present invention;
FIG. 8 is a diagram of another embodiment of a data compression system in accordance with an embodiment of the present invention;
fig. 9 is a schematic diagram of an embodiment of a data decompression system according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a data compression method, a data decompression method and a related system, which are used for removing redundant data by migrating and recombining similar data blocks after original data is divided into a plurality of data blocks, thereby solving the problem that data redundancy at a longer distance cannot be removed due to the limitation of the size of a sliding window in the traditional compression technology and simultaneously improving the compression rate of the data.
To facilitate the understanding of the document, the terms appearing herein are first explained as follows:
data blocking: the data block adopts a block algorithm to divide the file into a plurality of data blocks, and the selection of the block algorithm not only influences the block speed, but also has great influence on the detection effect of similar data blocks. The existing data blocking algorithm mainly comprises two basic strategies of fixed-length blocking and content-based blocking. The fixed-length blocks mark the cutting boundary according to the block positions, and the method is simple to implement and high in cutting speed. Due to the problem of boundary movement, the redundancy detection effect of fixed-length blocks is not ideal. And the content-based block determines the block boundary according to the local content of the data stream, which effectively solves the problem of boundary movement and divides the data stream into data blocks with indefinite length. In contrast, the content-based blocking algorithm can better adapt to the load of frequently modified content, can find more redundant data, and is widely applied to a storage system based on data deduplication.
And (3) similarity detection: similarity detection is used to identify data blocks with highly similar content to find and eliminate similar redundancies in the storage system. Storage systems typically determine similarity relationships between files based on comparing representative fingerprints of the files. The existing common similarity detection methods include a similarity detection method based on a super characteristic value, Simhash, Minhash and the like.
Data migration: data migration is a method for changing the sequence of partial data in a file, so that similar data can be clustered, thereby improving the file compression effect. While data migration provides a mechanism to recover metadata, the basic unit of migration is a data block. After the file is divided into a plurality of data blocks, the similar data blocks are identified through a similarity detection method, and then the positions of the similar data blocks are moved to enable the physical positions of the similar data blocks to be adjacent, so that the file data is easier to compress.
Data compression: data compression is a mainstream redundant data elimination technology, and redundant data information is eliminated mainly in a coding mode, namely on the premise that original data information is not lost, original content is converted, and a repeated byte sequence is represented by codes with fewer bytes, so that the aim of eliminating partial redundant data is fulfilled. The concept of "entropy of information" was first proposed by Claude Elwood Shannon (1916-2001) -there is redundancy in any information, the magnitude of which is related to the probability or uncertainty of occurrence of each symbol (number, letter or word) in the information. The information entropy theory of Shannon lays the theoretical foundation of data compression, and with the continuous increase of electronic digital information, the data compression technology is gradually developed into a lossless compression technology, a lossy compression technology and the like. The existing lossless compression technology is mostly developed by dictionary-based coding technologies LZ77 and LZ 78. The dictionary coding technology mainly adopts a buffer technology based on a sliding window, matches the current character sequence with the character sequence buffered in the sliding window, and if the character sequence is repeated, the character sequence is represented by a relatively short code, so that the redundancy elimination of a character string level is realized.
For convenience of understanding, fig. 1 shows a process schematic diagram of a data compression method, and the data compression method in the present invention is described below with reference to fig. 1, and referring to fig. 2, an embodiment of a data compression method in an embodiment of the present invention includes:
201. dividing original data into a plurality of data blocks;
it can be understood that data compression is to eliminate redundant data on the premise of ensuring that original data is not lost, thereby achieving the purposes of reducing storage space and accelerating data transmission.
The invention is based on the idea of similar data cluster combination, thereby eliminating redundant data to the greatest extent possible. In order to realize the cluster combination of similar data, the original data needs to be partitioned, so that the similar comparison of partitioned data is realized.
Data blocking is to divide original data into a plurality of data blocks by adopting a blocking algorithm. The granularity of the average block is about 8KB (the granularity of the block can be set to 4KB or 16KB according to needs), and the block algorithm can adopt a content-based block algorithm or a fixed-length block algorithm.
The fixed-length blocks mark the cutting boundary according to the block positions, and the method is simple to implement and high in cutting speed. Due to the problem of boundary movement, the redundancy detection effect of the fixed-length blocks is general. And the content-based block determines the block boundary according to the local content of the data stream, which effectively solves the problem of boundary movement and divides the data stream into data blocks with indefinite length. In contrast, the content-based blocking algorithm can better adapt to the load of frequently modified content, and more redundant data can be found.
202. Detecting similarity of a plurality of data blocks;
after the original data is subjected to data blocking, a plurality of data blocks are formed, and the data compression system performs similarity detection on the plurality of data blocks, wherein algorithms for the similarity detection are various, such as: super eigenvalue, Simhash or Minhash methods.
Specifically, how to implement the similarity detection of multiple data blocks by using the above algorithm is described in detail in the following embodiments.
It should be noted that the similarity detection algorithm in the present embodiment includes, but is not limited to, the above algorithm, and is not limited specifically here.
203. Sequentially migrating and recombining the similar data blocks to generate recombined data;
after the similarity detection is carried out on the plurality of data blocks, the similar data blocks are clustered and recombined to form a plurality of similar linked lists, the data compression system reads out the corresponding data blocks from the original data in sequence according to the similar linked lists, and then writes the read data blocks in sequence, so that recombined data can be generated.
Specifically, how to generate the similar linked list after the similarity detection of the plurality of data blocks, and how to obtain the recombined data according to the similar linked list are described in detail in the following embodiments.
204. And compressing the recombined data to generate compressed data.
After the multiple data blocks form the recombined data, the data compression system further compresses the recombined data through a traditional compression method, so that the redundancy of similar data blocks is removed to the maximum extent, and the compression rate of the original data is increased.
According to the method, the original data are divided into a plurality of data blocks, the similarity of the data blocks is detected, the similar data blocks are migrated and recombined to generate recombined data, and then the recombined data is compressed to obtain compressed data.
Referring to fig. 3, a data compression method according to an embodiment of the present invention is described in detail below based on the embodiment of fig. 2, where another embodiment of a data compression method according to an embodiment of the present invention includes:
301. dividing original data into a plurality of data blocks;
in order to achieve the purpose of clustering and recombining similar data in the present invention, the original data needs to be partitioned to obtain a plurality of data blocks. The data blocking is to divide original data into a plurality of data blocks by adopting a blocking algorithm. The granularity of the average block is about 8KB (the granularity of the block can be set to 4KB or 16KB according to needs), and the block algorithm can adopt a content-based block algorithm or a fixed-length block algorithm.
Specifically, the content and the characteristics of the content blocking and fixed-length blocking algorithm are described in detail in step 201 in the embodiment of fig. 2, and are not described herein again.
302. Recording the sequence, the offset and the block length of a plurality of data blocks to generate an original file spectrum;
in order to facilitate the recovery of the post-compression data to the original data, the data compression system needs to record the sequence, offset and block length of the data blocks in the original data, wherein the sequence of the data blocks is used to recover the sequence of the data blocks in the original data, and the offset and block length are used to accurately read the content of each data block. A file in which the order, offset, and block length of a plurality of data blocks are recorded is referred to as an original file spectrum.
FIG. 4 is a diagram of the organization of a file spectrum, showing an example of an original file spectrum named TEST file. The file spectrum mainly comprises file size, file name length, file information of the file name, the number of data blocks, offset of each data block and data block metadata of the block length.
303. Detecting similarity of a plurality of data blocks;
after the original data is subjected to data blocking, a plurality of data blocks are formed, and the data compression system performs similarity detection on the plurality of data blocks, wherein algorithms for the similarity detection are various, such as: super eigenvalue, Simhash or Minhash methods.
For example, assuming that N data blocks exist, a hash algorithm is respectively applied to the N data blocks, and the N hash values are obtained, that is, N super feature values, but in order to improve the recognition rate of similarity between the data blocks, a plurality of hash algorithms are respectively applied to the N data blocks, so that each data block corresponds to a plurality of super feature values, and each data block corresponds to a plurality of super feature value indexes. And comparing each super characteristic value of each data block with each super characteristic value of other data blocks, and if the same super characteristic value is found in the super characteristic values of some two data blocks, artificially representing the two data blocks as similar data blocks.
It should be noted that, in this embodiment, the global super-feature value index is used to detect the similarity of multiple data blocks, so that the range of similarity detection is expanded, and the detection effect of similar data blocks is improved. However, in this embodiment, the similar detection algorithm may also adopt a Simhash or Minhash detection algorithm, and a specific detection algorithm is not limited herein.
304. Sequentially migrating and recombining the similar data blocks to generate a plurality of similar linked lists;
in step 303, if the data compression system finds that there are data blocks with the same super characteristic value, the data blocks are sequentially added to the corresponding similar linked lists, and the offset and block length of each data block are recorded in the similar linked lists, so as to obtain a plurality of similar linked lists.
As shown in fig. 1, if the data block A, C, F is a similar data block, the data block A, C, F is recorded as a similar linked list 1, and the data block B, D, E is a similar data block, the data block B, D, E is recorded as a similar linked list 2, and if a certain data block does not have a super characteristic value that is the same as that of another data block, a new similar linked list is created for storing the data block.
305. Reading corresponding data block contents from the original data according to the plurality of similar linked lists to generate recombined data;
and sequentially reading the content of each data block from the original data according to the sequence, offset and block length of the data blocks in each similar linked list, and sequentially writing each read data block into a file to generate the recombined data.
As shown in fig. 1, according to the sequence, offset and block length of each data block recorded in the similarity linked list 1, the contents of the data block A, C, F are sequentially read from the original data and sequentially written into the file; according to the sequence, offset and block length of each data block recorded in the similar linked list 2, the content of the data block B, D, E is sequentially read from the original data, and then the file is sequentially written, and in the same way, according to the sequence of the similar linked list, the content of each data block is respectively read and sequentially written, so as to generate the restructured data, such as the restructured data A, C, F, B, D, E shown in fig. 1.
306. Updating the offsets of the data blocks according to the reorganized data to obtain a new original file spectrum;
after the reassembly data is generated, since the position of each data block is changed, the offset of each corresponding data block is also changed, as shown in fig. 1, in the original data, assuming that the a data block is 1K, the B data block is 2K, and the C data block is 3K, the offset of the C data block in the original data is the sum of the block lengths of the a data block and the B data block, i.e., 3K, and after the reassembly data is generated, since the position of the C data block is changed, the offset of the C data block is the block length of the a data block, i.e., 1K. In order to recover the original data according to the original file spectrum and the restructured data at a later stage, the data compression system needs to update the offsets of a plurality of data blocks in the original file spectrum according to the restructured data, and for convenience of description, the updated original file spectrum is referred to as a new original file spectrum.
307. Compressing the new original file spectrum to obtain a compressed file spectrum;
and after the new original file spectrum is obtained, compressing the new original file spectrum to obtain a compressed file spectrum, and for the convenience of later-stage decompression, storing the compressed file spectrum and later-stage compressed data in a correlation manner.
308. Compressing the recombined data to obtain compressed data;
after the reorganization data is obtained in step 305, because the similar data block clusters are reorganized, the redundancy of the similar data blocks can be eliminated as much as possible by compression, and compressed data with smaller capacity is obtained.
Further, the invention also solves the problem that redundancy which is too far away from the traditional compression method can not be eliminated due to the limitation of the size of the sliding window.
It should be noted that step 308 may also be executed before step 307, that is, there is no order restriction between step 307 and step 308, and in practice, for convenience of operation, step 307 may also be executed in combination with step 308, that is, the new original file spectrum and the reorganized data are compressed at the same time to obtain compressed data and a compressed file spectrum.
According to the method, the original data are divided into a plurality of data blocks, the similarity of the data blocks is detected, the similar data blocks are migrated and recombined to generate recombined data, and then the recombined data is compressed to obtain compressed data.
With reference to fig. 6, the data compression method in the present invention is described above, and the data decompression method in the present invention is described below, wherein an embodiment of the data decompression method in the embodiment of the present invention includes:
601. decompressing the compressed data and the compressed file spectrum to respectively obtain recombined data and a new original file spectrum;
based on the embodiment of fig. 3, to recover the original data after obtaining the compressed data and the compressed file spectrum, the data decompression system needs to decompress the compressed data and the compressed file spectrum, and after decompression, the reconstructed data and the new original file spectrum can be obtained, and fig. 5 is a schematic process diagram of the data decompression method.
As shown in fig. 5, after the compressed data and the compressed file spectrum are decompressed, the restructured data and the new original file spectrum are obtained.
602. Respectively reading a plurality of data blocks from the recombined data according to the sequence, the offset and the block length of the plurality of data blocks recorded by the new original file spectrum;
and decompressing the compressed data and the compressed file spectrum to obtain recombined data and a new original file spectrum, wherein the new original file spectrum records the sequence and the block length of each data block in the original data and the offset of each data block in the recombined data. Therefore, the data decompression system reads the content of each data block from the reorganized data in sequence according to the sequence and the block length of each data block in the original data recorded in the new original file spectrum and the offset of each data block in the reorganized data.
As shown in fig. 5, the content of each data block is read from the reconstructed data A, D, F, B, C, E in the order of the plurality of data blocks recorded in the original data, based on the order A, B, C, D, E, F of the plurality of data blocks recorded in the new original file spectrum, and the offset and block length of each data block in the reconstructed data.
603. And sequentially writing the data blocks according to the sequence of the data blocks recorded by the new original file spectrum to obtain original data.
In step 602, the data decompression system sequentially reads the contents of the data blocks from the reconstructed data according to the sequence of the data blocks recorded in the original data, and then sequentially writes the contents of the data blocks, thereby recovering the original data.
It should be noted that, if the data is stored in the disk, because the data in the disk is written sequentially, and the contents of each data block in the reassembled data are read according to the sequence of the data blocks recorded in the original data, the contents are read non-sequentially, so that a certain I/O overhead of the disk is caused, and the life of the disk is shortened.
According to the data compression method, compressed data and a compressed file spectrum are decompressed to obtain recombined data and a new original file spectrum respectively, a plurality of data blocks are read from the recombined data respectively according to the sequence, the offset and the block length of the plurality of data blocks recorded by the new original file spectrum, and then the plurality of data blocks are written in sequence, so that the original data can be recovered.
With reference to fig. 7, the data compression method in the present invention is described above, and the following describes the data compression system in the present invention, and an embodiment of a data compression system in the embodiment of the present invention includes:
a block unit 701, configured to divide original data into a plurality of data blocks;
a detecting unit 702, configured to detect similarities of a plurality of data blocks;
a reorganizing unit 703, configured to sequentially migrate and reorganize the similar data blocks to generate reorganized data;
a compressing unit 704, configured to compress the reassembled data to generate compressed data.
It should be noted that the functions of the units in this embodiment are the same as the functions of the data compression system described in the embodiment of fig. 2, and are not described herein again.
According to the invention, the bronze drum partitioning unit 701 partitions original data into a plurality of data blocks, the detection unit 702 detects the similarity of the data blocks, the similar data blocks are migrated and recombined to generate recombined data, and then the compression unit 704 compresses the recombined data to obtain compressed data.
For ease of understanding, the data compression system in the embodiment of the present invention is described in detail below, and referring to fig. 8, another embodiment of the data compression system in the embodiment of the present invention includes:
a block unit 801 configured to divide original data into a plurality of data blocks;
a detecting unit 802, configured to detect similarities of multiple data blocks;
a restructuring unit 803, configured to migrate and restructure similar data blocks in sequence to generate restructured data;
a first compressing unit 804, configured to compress the reassembled data to generate compressed data.
Further, the data compression system further comprises:
a first generating unit 805, configured to record the sequence, offset, and block length of a plurality of data blocks, and generate an original file spectrum;
an updating unit 806, configured to update offsets of the multiple data blocks according to the reorganized data to obtain a new original file spectrum;
a second compressing unit 807 for compressing the new original file spectrum to generate a compressed file spectrum.
Wherein the recombination unit 803 includes:
a first generating module 8031, configured to migrate and recombine the similar data blocks in sequence to generate a plurality of similar linked lists;
the second generating module 8032 is configured to read, according to the multiple similar linked lists, content of the corresponding data block from the original data, and generate the reassembled data.
The detecting unit 802 includes:
the detecting module 8021 is configured to detect similarities of the multiple data blocks by using a super-eigenvalue method, a Simhash method, or a Minhash method.
It should be noted that the functions of the units and the modules are similar to those of the data compression system in the embodiment of fig. 3, and are not described again here.
According to the invention, the bronze drum partitioning unit 801 partitions original data into a plurality of data blocks, the detection unit 802 detects the similarity of the data blocks, the similar data blocks are migrated and recombined to generate recombined data, and then the compression unit 804 compresses the recombined data to obtain compressed data.
With reference to fig. 9, an embodiment of a data decompression system in an embodiment of the present invention, which has been described above with reference to a data compression system, includes:
the decompression unit 901 is used for decompressing the compressed data and the compressed file spectrum to respectively obtain the restructured data and the new original file spectrum;
a reading unit 902, configured to read a plurality of data blocks from the reconstructed data according to offsets and block lengths of the plurality of data blocks recorded in the new and original file spectrum;
the writing unit 903 writes a plurality of data blocks in sequence according to the sequence of the plurality of data blocks recorded in the new original file spectrum to obtain original data.
It should be noted that the functions of the units in this embodiment are similar to the functions of the data decompression system in the embodiment of fig. 6, and are not described here again.
In the method for compressing data, according to the present invention, a decompression unit 901 decompresses compressed data and a compressed file spectrum to obtain reconstructed data and a new original file spectrum, respectively, and a reading unit 902 reads a plurality of data blocks from the reconstructed data according to the sequence, offset, and block length of the plurality of data blocks recorded in the new original file spectrum, and then sequentially writes the plurality of data blocks, thereby recovering the original data.
The data compression system and the data decompression system in the embodiment of the present invention are described above from the perspective of the modular functional entity, and the computer apparatus in the embodiment of the present invention is described below from the perspective of hardware processing:
the computer device is used for realizing the functions of one side of a data compression system, and one embodiment of the computer device in the embodiment of the invention comprises the following steps:
a processor and a memory;
the memory is used for storing the computer program, and the processor is used for realizing the following steps when executing the computer program stored in the memory:
dividing original data into a plurality of data blocks;
detecting similarity of a plurality of data blocks;
sequentially migrating and recombining the similar data blocks to generate recombined data;
and compressing the recombined data to generate compressed data.
In some embodiments of the present invention, the processor may be further configured to:
and recording the sequence, the offset and the block length of the plurality of data blocks to generate an original file spectrum.
In some embodiments of the present invention, the processor may be further configured to:
updating the offsets of the data blocks according to the reorganized data to obtain a new original file spectrum;
and compressing the new and original file spectrum to generate a compressed file spectrum.
In some embodiments of the present invention, the processor may be further configured to:
sequentially migrating and recombining the similar data blocks to generate a plurality of similar linked lists;
and reading the corresponding data block content from the original data according to the plurality of similar linked lists to generate recombined data.
In some embodiments of the present invention, the processor may be further configured to:
and detecting the similarity of the data blocks by a super characteristic value method, a Simhash method or a Minhash method.
The computer device can also be used for realizing the functions of one side of the data decompression system, and another embodiment of the computer device in the embodiment of the invention comprises the following steps:
decompressing the compressed data and the compressed file spectrum to respectively obtain recombined data and a new original file spectrum;
respectively reading a plurality of data blocks from the recombined data according to the sequence, the offset and the block length of the plurality of data blocks recorded by the new original file spectrum;
and sequentially writing the data blocks according to the sequence of the data blocks recorded by the new original file spectrum to obtain original data.
It should be understood that, no matter on the data compression system side or the data decompression system side, when the processor in the computer device described above executes the computer program, the functions of the units in the corresponding device embodiments described above may also be implemented, and thus, no further description is provided herein. Illustratively, a computer program may be partitioned into one or more modules/units, which are stored in a memory and executed by a processor to implement the present invention. One or more of the modules/units may be a series of computer program instruction segments capable of performing specific functions, the instruction segments being used to describe the execution of a computer program in a data compression system/data decompression system. For example, a computer program may be divided into units in the data compression systems described above, and each unit may implement specific functions as described above for the corresponding data compression system.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the processor and the memory are merely examples of a computer apparatus and do not constitute a limitation of the computer apparatus, and may include more or less components, or combine certain components, or different components, for example, the computer apparatus may further include input and output devices, network access devices, buses, and the like.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the computer device and the various interfaces and lines connecting the various parts of the overall computer device.
The memory may be used to store computer programs and/or modules, and the processor may implement various functions of the computer device by executing or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the terminal, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The present invention also provides a computer-readable storage medium for implementing functions on one side of a data compression system, having a computer program stored thereon, which, when executed by a processor, the processor is operable to perform the steps of:
dividing original data into a plurality of data blocks;
detecting similarity of a plurality of data blocks;
sequentially migrating and recombining the similar data blocks to generate recombined data;
and compressing the recombined data to generate compressed data.
In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:
and recording the sequence, the offset and the block length of the plurality of data blocks to generate an original file spectrum.
Updating the offsets of the data blocks according to the reorganized data to obtain a new original file spectrum;
and compressing the new and original file spectrum to generate a compressed file spectrum.
Sequentially migrating and recombining the similar data blocks to generate a plurality of similar linked lists;
and reading the corresponding data block content from the original data according to the plurality of similar linked lists to generate recombined data.
And detecting the similarity of the data blocks by a super characteristic value method, a Simhash method or a Minhash method.
The present invention also provides another computer-readable storage medium for implementing functions on a data decompression system side, having a computer program stored thereon, which, when executed by a processor, the processor is operable to perform the steps of:
decompressing the compressed data and the compressed file spectrum to respectively obtain recombined data and a new original file spectrum;
respectively reading a plurality of data blocks from the recombined data according to the sequence, the offset and the block length of the plurality of data blocks recorded by the new original file spectrum;
and sequentially writing the data blocks according to the sequence of the data blocks recorded by the new original file spectrum to obtain original data.
It will be appreciated that the integrated units, if implemented as software functional units and sold or used as separate products, may be stored in a corresponding one of the computer readable storage media. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and used to instruct related hardware to implement the steps of the above embodiments. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (11)
1. A method of data compression, comprising:
dividing original data into a plurality of data blocks;
detecting a similarity of the plurality of data blocks;
sequentially migrating and recombining the similar data blocks to generate recombined data;
compressing the recombined data by adopting a cache technology based on a sliding window to generate compressed data;
the sequentially migrating and recombining the similar data blocks to generate recombined data comprises the following steps:
sequentially migrating and recombining the similar data blocks to generate a plurality of similar linked lists;
and reading corresponding data block contents from the original data according to the plurality of similar linked lists to generate recombined data.
2. The method of claim 1, wherein after the dividing the raw data into a plurality of data blocks and before the detecting the similarity of the plurality of data blocks, the method further comprises:
and recording the sequence, the offset and the block length of the plurality of data blocks to generate an original file spectrum.
3. The method according to claim 2, wherein after the similar data blocks are migrated and reorganized in sequence to generate reorganized data, and before the reorganized data is compressed to generate compressed data, the method further comprises:
and updating the offsets of the plurality of data blocks according to the reorganization data to obtain a new original file spectrum.
4. The method according to any one of claims 1 to 3, wherein the detecting the similarity of the plurality of data blocks comprises:
and detecting the similarity of the data blocks by a super characteristic value method, a Simhash method or a Minhash method.
5. A method of data decompression, comprising:
decompressing compressed data and a compressed file spectrum to obtain reconstructed data and a new original file spectrum respectively, wherein the reconstructed data is obtained by dividing original data into a plurality of data blocks, sequentially migrating and reconstructing similar data blocks in the plurality of data blocks to generate a plurality of similar linked lists, and reading corresponding data block contents from the original data according to the plurality of similar linked lists;
reading the data blocks from the recombined data according to the sequence, the offset and the block length of the data blocks recorded by the new original file spectrum;
and sequentially writing the data blocks according to the sequence of the data blocks recorded by the new original file spectrum to obtain original data.
6. A data compression system, comprising:
a block unit for dividing original data into a plurality of data blocks;
a detecting unit configured to detect similarities of the plurality of data blocks;
the recombination unit is used for sequentially migrating and recombining the similar data blocks to generate recombined data;
the compression unit is used for compressing the recombined data by adopting a cache technology based on a sliding window to generate compressed data;
the recombination unit comprises:
the first generation module is used for sequentially migrating and recombining the similar data blocks to generate a plurality of similar linked lists;
and the second generation module is used for reading corresponding data block contents from the original data according to the plurality of similar linked lists and generating the recombined data.
7. A data decompression system, comprising:
the device comprises a decompression unit, a storage unit and a processing unit, wherein the decompression unit is used for decompressing compressed data and a compressed file spectrum to respectively obtain recombined data and a new original file spectrum, the recombined data is obtained by dividing original data into a plurality of data blocks, sequentially migrating and recombining similar data blocks in the plurality of data blocks to generate a plurality of similar linked lists, and reading corresponding data block contents from the original data according to the plurality of similar linked lists;
a reading unit, configured to read the multiple data blocks from the reorganized data according to the sequence, offset, and block length of the multiple data blocks recorded by the new original file spectrum;
and the writing unit is used for sequentially writing the data blocks according to the sequence of the data blocks recorded by the new original file spectrum to obtain original data.
8. A computer arrangement comprising a processor for carrying out the steps of the data compression method according to any one of claims 1 to 4 when executing a computer program stored on a memory.
9. A computer arrangement comprising a processor for carrying out the steps of the data decompression method according to claim 5 when executing a computer program stored on a memory.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the data compression method according to any one of claims 1 to 4.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, is adapted to carry out the steps of the data decompression method according to claim 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710884914.2A CN107682016B (en) | 2017-09-26 | 2017-09-26 | Data compression method, data decompression method and related system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710884914.2A CN107682016B (en) | 2017-09-26 | 2017-09-26 | Data compression method, data decompression method and related system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107682016A CN107682016A (en) | 2018-02-09 |
CN107682016B true CN107682016B (en) | 2021-09-17 |
Family
ID=61137381
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710884914.2A Active CN107682016B (en) | 2017-09-26 | 2017-09-26 | Data compression method, data decompression method and related system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107682016B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108427538B (en) * | 2018-03-15 | 2021-06-04 | 深信服科技股份有限公司 | Storage data compression method and device of full flash memory array and readable storage medium |
CN110083743B (en) * | 2019-03-28 | 2021-11-16 | 哈尔滨工业大学(深圳) | Rapid similar data detection method based on unified sampling |
CN112099725A (en) | 2019-06-17 | 2020-12-18 | 华为技术有限公司 | Data processing method and device and computer readable storage medium |
CN110781155B (en) * | 2019-10-18 | 2022-06-24 | 赛尔网络有限公司 | Data storage reading method, system, equipment and medium based on IPFS |
CN110888918A (en) * | 2019-11-25 | 2020-03-17 | 湖北工业大学 | Similar data detection method and device, computer equipment and storage medium |
CN111984615B (en) | 2020-08-04 | 2024-05-28 | 中国人民银行数字货币研究所 | File sharing method, device and system |
CN112665886B (en) * | 2020-12-11 | 2023-06-27 | 浙江中控技术股份有限公司 | Data conversion method for measuring high-frequency original data by vibration of large rotary machine |
CN115145884A (en) * | 2021-03-30 | 2022-10-04 | 华为技术有限公司 | Data compression method and device |
CN115858478B (en) * | 2023-02-24 | 2023-05-12 | 山东中联翰元教育科技有限公司 | Data rapid compression method of interactive intelligent teaching platform |
CN118337221B (en) * | 2024-06-13 | 2024-09-03 | 陕西颐刚盛讯科技有限责任公司 | Network security data transmission method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101667843A (en) * | 2009-09-22 | 2010-03-10 | 中兴通讯股份有限公司 | Methods and devices for compressing and uncompressing data of embedded system |
CN102065098A (en) * | 2010-12-31 | 2011-05-18 | 网宿科技股份有限公司 | Method and system for synchronizing data among network nodes |
CN103020317A (en) * | 2013-01-10 | 2013-04-03 | 曙光信息产业(北京)有限公司 | Device and method for data compression based on data deduplication |
CN103067022A (en) * | 2012-12-19 | 2013-04-24 | 中国石油天然气集团公司 | Nondestructive compressing method, uncompressing method, compressing device and uncompressing device for integer data |
CN107087184A (en) * | 2017-04-28 | 2017-08-22 | 华南理工大学 | A kind of multi-medium data recompression method |
US9767154B1 (en) * | 2013-09-26 | 2017-09-19 | EMC IP Holding Company LLC | System and method for improving data compression of a storage system in an online manner |
CN107251438A (en) * | 2015-02-16 | 2017-10-13 | 三菱电机株式会社 | Data compression device, data decompression device, data compression method, uncompressing data and program |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102737132A (en) * | 2012-06-25 | 2012-10-17 | 天津神舟通用数据技术有限公司 | Multi-rule combined compression method based on database row and column mixed storage |
CN104142924A (en) * | 2013-05-06 | 2014-11-12 | 中国移动通信集团福建有限公司 | Method and device for compressing flash picture format |
CN104283567B (en) * | 2013-07-02 | 2018-07-03 | 北京四维图新科技股份有限公司 | A kind of compression of name data, decompression method and equipment |
CN105204781B (en) * | 2015-09-28 | 2019-04-12 | 华为技术有限公司 | Compression method, device and equipment |
-
2017
- 2017-09-26 CN CN201710884914.2A patent/CN107682016B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101667843A (en) * | 2009-09-22 | 2010-03-10 | 中兴通讯股份有限公司 | Methods and devices for compressing and uncompressing data of embedded system |
CN102065098A (en) * | 2010-12-31 | 2011-05-18 | 网宿科技股份有限公司 | Method and system for synchronizing data among network nodes |
CN103067022A (en) * | 2012-12-19 | 2013-04-24 | 中国石油天然气集团公司 | Nondestructive compressing method, uncompressing method, compressing device and uncompressing device for integer data |
CN103020317A (en) * | 2013-01-10 | 2013-04-03 | 曙光信息产业(北京)有限公司 | Device and method for data compression based on data deduplication |
US9767154B1 (en) * | 2013-09-26 | 2017-09-19 | EMC IP Holding Company LLC | System and method for improving data compression of a storage system in an online manner |
CN107251438A (en) * | 2015-02-16 | 2017-10-13 | 三菱电机株式会社 | Data compression device, data decompression device, data compression method, uncompressing data and program |
CN107087184A (en) * | 2017-04-28 | 2017-08-22 | 华南理工大学 | A kind of multi-medium data recompression method |
Also Published As
Publication number | Publication date |
---|---|
CN107682016A (en) | 2018-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107682016B (en) | Data compression method, data decompression method and related system | |
CN107506153B (en) | Data compression method, data decompression method and related system | |
CN107210753B (en) | Lossless reduction of data by deriving data from prime data units residing in a content association filter | |
CN108427538B (en) | Storage data compression method and device of full flash memory array and readable storage medium | |
CN109716658B (en) | Method and system for deleting repeated data based on similarity | |
CN110741637B (en) | Method for simplifying video data, computer readable storage medium and electronic device | |
CN112380196B (en) | Server for data compression transmission | |
CN111125033A (en) | Space recovery method and system based on full flash memory array | |
US20200294629A1 (en) | Gene sequencing data compression method and decompression method, system and computer-readable medium | |
CN108475508B (en) | Simplification of audio data and data stored in block processing storage system | |
Feng et al. | MLC: an efficient multi-level log compression method for cloud backup systems | |
US10817475B1 (en) | System and method for encoding-based deduplication | |
CN111124940B (en) | Space recovery method and system based on full flash memory array | |
CN116303297A (en) | File compression processing method, device, equipment and medium | |
WO2021082926A1 (en) | Data compression method and apparatus | |
CN111124939A (en) | Data compression method and system based on full flash memory array | |
CN111124259A (en) | Data compression method and system based on full flash memory array | |
Talasila et al. | Generalized deduplication: Lossless compression by clustering similar data | |
US11360954B2 (en) | System and method for hash-based entropy calculation | |
CN111198857A (en) | Data compression method and system based on full flash memory array | |
CN111625186B (en) | Data processing method, device, electronic equipment and storage medium | |
US10963437B2 (en) | System and method for data deduplication | |
US10990565B2 (en) | System and method for average entropy calculation | |
CN116601593A (en) | Data compression device, data storage device and method for data compression and data de-duplication | |
KR20210113297A (en) | Systems, methods, and apparatus for eliminating duplication and value redundancy in computer memory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |