WO2017128763A1 - 数据压缩装置及方法 - Google Patents

数据压缩装置及方法 Download PDF

Info

Publication number
WO2017128763A1
WO2017128763A1 PCT/CN2016/101494 CN2016101494W WO2017128763A1 WO 2017128763 A1 WO2017128763 A1 WO 2017128763A1 CN 2016101494 W CN2016101494 W CN 2016101494W WO 2017128763 A1 WO2017128763 A1 WO 2017128763A1
Authority
WO
WIPO (PCT)
Prior art keywords
compressed
reference data
data
similarity
data blocks
Prior art date
Application number
PCT/CN2016/101494
Other languages
English (en)
French (fr)
Inventor
关坤
全绍晖
沈建强
王工艺
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017128763A1 publication Critical patent/WO2017128763A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • the present invention relates to the field of data processing, and in particular, to a data compression apparatus and method.
  • data compression is a way to reduce the amount of duplicate data through specific steps to reduce storage space.
  • Differential (English: Delta) compression is a commonly used lossless data compression method.
  • the method includes the following steps: detecting, by using a specific similarity detection algorithm, a reference data block having the highest similarity with the data to be compressed from the plurality of reference data blocks, and performing difference calculation between the compressed data and the reference data block to obtain a difference data. Compress the result.
  • the existing differential compression algorithm requires a high degree of similarity between the reference data block and the data to be compressed. When there is no compressed data block with high similarity between the data to be compressed, the compression effect is poor.
  • the embodiment of the present invention provides a data compression apparatus and method.
  • the technical solution is as follows:
  • a data compression method includes: acquiring data to be compressed and m reference data blocks, where m is greater than 1, and m is an integer; and the data to be compressed is matched with the m reference data blocks.
  • Obtaining at least one index code each index code includes a reference data block identifier and a string information, each reference data block identifier indicating one of the m reference data blocks, each string information indicating the data to be compressed The positional information of the continuous string in the reference data block.
  • the data compression method provided by the embodiment of the present invention compresses the data to be compressed by using multiple reference data blocks, and can ensure high compression efficiency when there is no compressed data block with high similarity between the data to be compressed.
  • the data compression method shown in the embodiment of the present invention has a lower requirement for similarity between the reference data block and the data to be compressed, and the similarity matching algorithm is simple and versatile. It is easy to match the reference data block that meets the requirements, and can improve the compression efficiency while ensuring the compression effect.
  • the similarity between each of the preset reference data blocks and the data to be compressed is calculated one by one, and the respective reference data blocks are obtained.
  • the m reference data blocks whose similarity between the data to be compressed is greater than a preset threshold.
  • the obtaining the m similarity between the data and the data to be compressed is the largest
  • the reference data block includes: when each reference data block with the similarity between the data to be compressed and the preset threshold is calculated, the number of matches is increased by 1, and the initial value of the matching quantity is 0; Whether the upper limit of the preset quantity is reached, M ⁇ 2, and M is an integer; if the result of the determination is that the matching quantity reaches the preset upper limit M, the similarity between the calculated data and the data to be compressed is greater than
  • the reference data block of the preset threshold is obtained as m reference data blocks that match the data to be compressed; if the result of the determination is that the number of matches does not reach the preset upper limit M, it is determined that each of the reference data blocks respectively Whether the similarity between the data to be compressed is all calculated; if the result of the determination is that the similarity between each of the reference data blocks and the data to be compressed is all calculated
  • the data compression method provided by the embodiment of the present invention only needs to obtain a plurality of reference data blocks with lower similarity requirements from a plurality of reference data blocks, because of the similarity requirement between the reference data block and the data to be compressed. Lower, therefore, the similarity between each reference data block and the data to be compressed is calculated one by one. When the number of reference data blocks matching the data to be compressed is calculated to be sufficient, the subsequent calculation process can be stopped, thereby shortening the matching process. Improve compression efficiency.
  • the method further includes: when there is a continuous character string that does not correspond to at least one index code in the data to be compressed, generating a continuous packet that includes at least one index code An insertion encoding of a string indicating that a continuous string that does not correspond to at least one index encoding is inserted at the time of decompression.
  • the matching, the data, and the m reference data blocks are matched according to the data to be compressed, including: connecting the m reference data blocks end to end to obtain a total reference data block; The data to be compressed is matched with the total reference data block.
  • an embodiment of the present invention provides a computing device, where the computing device includes: processing And a memory; a memory is coupled to the processor via a bus; the processor is configured to execute instructions stored in the memory; and the processor implements the first aspect or any of the possible implementations of the first aspect by executing the instructions Data compression method.
  • an embodiment of the present invention provides a data compression apparatus, where the data compression apparatus includes at least one unit, and the at least one unit is configured to implement the foregoing first aspect or any one of the possible implementation manners of the first aspect. Data compression method.
  • FIG. 1 is a block diagram showing the structure of a computing device shown in an exemplary embodiment of the present invention
  • 2A is a flowchart of a method of data compression method according to an exemplary embodiment of the present invention.
  • FIG. 2B is a flowchart of a method for acquiring a reference data block according to the embodiment shown in FIG. 2A;
  • FIG. 2C is a flowchart of a multi-reference block-based differential compression method according to the embodiment shown in FIG. 2A;
  • FIG. 3 is a block diagram of a data compression apparatus provided by an exemplary embodiment of the present invention.
  • the computing device 100 can include a processor 110, a memory 130, and a bus 150.
  • the memory 130 is coupled to the processor 110 via a bus 150.
  • the processor 110 includes an arithmetic logic component, a register component, and a control component, etc., which may be an independent central processing unit, or may be an embedded processor, such as a microprocessor (English: Micro Processor Unit, abbreviation: MPU), micro Controller (English: Microcontroller Unit, abbreviation: MCU) or digital signal processor (English: Embedded Digital Signal Processor, abbreviation: EDS).
  • MPU Micro Processor Unit
  • MCU Microcontroller Unit
  • EDS Embedded Digital Signal Processor
  • Memory 130 is comprised of any type of volatile or non-volatile storage device or combination thereof Now, such as static random access memory (English: Static Random Access Memory, SRAM for short), electrically erasable programmable read-only memory (English: Electrically Erasable Programmable Read-Only Memory, EEPROM), can be erased Programmable Readable Memory (English: Erasable Programmable Read Only Memory, EPROM for short), Programmable Read-Only Memory (English: Programmable Read-Only Memory, PROM for short), Read Only Memory (English: Read Only Memory, ROM for short) ), magnetic memory, flash memory, disk or optical disk.
  • the memory 130 can be used to store instructions that can be implemented as software programs or software modules.
  • the processor 110 may implement all or part of the steps of the data compression method in the embodiment shown in FIG. 2A below by executing the instructions stored in the memory 130.
  • the computing device 100 may further include components such as the communication component 120 and the cache 140.
  • Communication component 120 and cache 140 are coupled to processor 110 via bus 150, respectively.
  • the communication component 120 is for external communication, including communication with external networks or with other computing or storage devices. It can include multiple types of interfaces, such as an Ethernet interface or a wireless transceiver.
  • the cache 140 is used to cache some intermediate data in the processor 110 calculation process.
  • FIG. 2A is a flowchart of a method of data compression method according to an exemplary embodiment of the present invention, which may be used in a computing device as shown in FIG. 1.
  • the data compression method may include:
  • Step 201 Acquire data to be compressed and m reference data blocks, where m is greater than 1, and m is an integer.
  • the computing device may calculate the similarity between each of the preset reference data blocks and the data to be compressed, and obtain the data to be compressed in the respective reference data blocks.
  • the m correlation data blocks whose similarity is greater than the preset threshold.
  • a plurality of reference data blocks to be selected are pre-stored in the computing device, and when the data to be compressed is compressed, only a plurality of reference data blocks need to be obtained from the plurality of reference data blocks.
  • the reference data block may be used.
  • the computing device may calculate a similarity between each reference data block and the data to be compressed by using a relatively simple similarity calculation algorithm, and m reference data blocks in which the similarity is greater than a preset threshold. Obtaining m reference data blocks that match the data to be compressed, or the computing device may also obtain m reference data blocks in which the similarity is the largest as m reference data blocks that match the data to be compressed.
  • the similarity between each reference data block and the data to be compressed may be calculated one by one.
  • the number of reference data blocks matching the data to be compressed may be sufficient.
  • the subsequent calculation process is stopped, thereby shortening the matching process and improving the compression efficiency.
  • FIG. 2B is a flowchart of a method for acquiring a reference data block involved in FIG. 2A. As shown in FIG. 2B, the method may include the following steps:
  • Step 201a Calculate the similarity between each reference data block and the data to be compressed.
  • N reference data blocks may be pre-stored, N is greater than or equal to M, and steps 201a to 201e select M references with higher similarity to the data to be compressed from the N reference data blocks. data block.
  • Block or the data of the data to be compressed (Rabin) fingerprint for example, for the reference data block and the data to be compressed, respectively select a fixed number of Rabin fingerprints to form a respective feature subset, each fingerprint in the feature subset Is a hash (English: Hash) value corresponding to the feature subset, and the similarity between the reference data block and the data to be compressed is determined by calculating the number of matching Rabin fingerprints in the two feature subsets.
  • the Rabin fingerprints of the corresponding feature subsets of the two may be compared one by one until all the comparison ends, or the matched Rabin fingerprints are compared.
  • the proportion in the feature subset has reached a preset threshold. For example, when the preset threshold is 20%, the feature subset corresponding to each of the reference data block and the data to be compressed each includes five Rabin fingerprints.
  • the similarity first compare the first Rabin of each of the two feature subsets. If the fingerprint matches, the comparison ends, and the reference data block is determined to match the data to be compressed. Otherwise, the second Rabin fingerprint of each of the two feature subsets is continuously compared, and so on, until all comparisons are completed. Or, compare to the matching Rabin fingerprint.
  • the feature subset may be reduced to the super feature subset, or the Rabin fingerprint in the feature subset may be reduced to super. Fingerprints to form a subset of features with fewer fingerprints.
  • step 201b each time a reference data block with a similarity between the data to be compressed and a preset threshold is calculated, the number of matches is incremented by one.
  • step 201c it is determined whether the matching quantity reaches the preset upper limit M. If yes, the process proceeds to step 201d, otherwise, the process proceeds to step 201e.
  • the initial value of the matching number is 0, M ⁇ 2, and M is an integer.
  • M is a preset fixed value which can be set by the developer or the user according to the actual compression scenario.
  • step 201d the reference data block whose calculated similarity with the data to be compressed is greater than the preset threshold is obtained as m reference data blocks that match the data to be compressed.
  • each reference data block that matches the data to be compressed is calculated, that is, the calculated and the calculated Whether the reference data block matched by the compressed data is sufficient, and if so, obtains the calculated reference data blocks that match the data to be compressed, and stops the subsequent calculation process.
  • step 201e it is determined whether the similarity between each of the reference data blocks and the data to be compressed is all calculated. If yes, the process proceeds to step 201d, otherwise, the process returns to step 201a.
  • the similarity between the respective reference data blocks and the data to be compressed is not completely calculated, the similarity between the next reference data block and the data to be compressed is continuously calculated.
  • each time a reference data block matching the data to be compressed is calculated if it is determined that the calculated reference data block matching the data to be compressed is not enough, the subsequent calculation may be continued until the quantity is sufficient or all All the reference data blocks are calculated.
  • the calculated data corresponding to the data to be compressed are obtained. Refer to the data block and stop the subsequent calculation process.
  • Step 202 Match the data to be compressed with the m reference data blocks to obtain at least one index code, where each index code includes a reference data block identifier and a string information, and each reference data block identifier indicates m reference data blocks.
  • each index code includes a reference data block identifier and a string information
  • each reference data block identifier indicates m reference data blocks.
  • Each of the index codes corresponds to one consecutive character string in the data to be compressed, and the index code indicates which of the m reference data blocks the corresponding consecutive character string exists in which of the reference data blocks.
  • the string information in the index code may include a starting position of the corresponding consecutive character string in the reference data block and a number of characters, indicating that the reference data block corresponding to the identifier of the reference data block starts from the starting position. The specified number of characters also exist in the data to be compressed.
  • Step 203 When there is a continuous word in the data to be compressed that does not correspond to at least one index code
  • an insertion code is generated that contains a contiguous string that does not correspond to at least one index encoding, the insertion code being used to indicate that a consecutive string that does not correspond to at least one index encoding is inserted at the time of decompression.
  • a part of the characters in the data to be compressed may not exist in any one of the reference data blocks of the m reference data blocks.
  • the computing device may generate the insertion code correspondingly, and each insertion code includes no A continuous string of any one of the reference data blocks of the m reference data blocks.
  • Step 204 Output the at least one index coding and the insertion coding as a compression result according to a sequence of the consecutive consecutive character strings corresponding to the at least one index coding and the insertion coding in the data to be compressed.
  • FIG. 2C illustrates a flowchart of a multi-reference block-based differential compression method according to the exemplary embodiment shown in FIG. 2A.
  • the method may include the following steps. :
  • step 20a the data to be compressed is divided into a number of consecutive strings.
  • the continuous string is a string in which the target reference block is present, and the target reference block is a data block in the m reference data blocks that includes the continuous string; or the continuous string is included in any one of the characters.
  • the divided continuous strings may have the following two types:
  • the first type is a continuous string existing in one reference data block of m reference data blocks.
  • the division method of such a string can be as follows:
  • the computing device starts from the first character of the undivided characters in the data to be compressed, and queries whether the first character exists in one reference data block of the m reference data blocks, and if yes, continues to query the undivided characters. Whether the string consisting of the first two characters exists in one reference data block of m reference data blocks, and so on, until the string consisting of the first p characters in the undivided character is found in m In a reference data block of the reference data block, and the character string consisting of the first p+1 characters in the undivided characters does not exist in any reference data block of the m reference data blocks, the undivided characters are The string consisting of the first p characters is divided into a continuous string, and p is an integer greater than or equal to 1.
  • the computing device may also set a character number threshold q for the first type of continuous string.
  • p When p reaches q, it does not query whether the string consisting of the first q+1 characters in the undivided characters exists in m.
  • the first q of the undivided characters In the reference data block of the reference data block, directly the first q of the undivided characters A string consisting of characters is divided into a continuous string.
  • the second type is a continuous string that does not exist in any reference data block of m reference data blocks.
  • the division method of such a string can be as follows:
  • the computing device starts from the first character of the undivided characters in the data to be compressed, and queries whether the first character does not exist in any reference data block of the m reference data blocks, and if yes, continues to query the undivided Whether the second character in the character does not exist in any reference data block of the m reference data blocks, and so on, until the p'th character in the undivided character is not present in the m reference data In any reference data block of the block, and the p'th character in the undivided character exists in one reference data block of the m reference data blocks, the first p' characters in the undivided character
  • the composed string is divided into a continuous string, and p' is an integer greater than or equal to 1.
  • the computing device may also set a character number threshold q' for the second type of continuous string.
  • q' the character number threshold
  • the character string consisting of the first q' characters in the undivided characters is directly divided into a continuous character string.
  • Step 20b When a continuous character string is a continuous character string existing in one reference data block of the m reference data blocks, an index code of the continuous character string is generated.
  • Step 20c When a continuous character string is a continuous character string that does not exist in any of the reference data blocks of the m reference data blocks, an insertion code including the continuous character string is generated.
  • step 20d the generated index codes and the insertion codes are arranged according to the positions of the corresponding consecutive character strings in the data to be compressed, and the compression code corresponding to the data to be compressed is obtained.
  • Reference data block 1 is: ABCDEFGHIABCDEFGHIMNOPQRST
  • Reference data block 2 is: 12345678910111213141516171
  • Reference data block 3 is: abcdefghijklmnopqrstuvwxyz
  • the data to be compressed is: ABCDEFGHIABCDEFGHI234567891011abcdefghijklXYZ
  • the index code consists of a fixed format " ⁇ ,,>" and three data.
  • the first data is the reference data block identifier, and the reference data block used by the index code is the reference data block 1.
  • the reference data block used by the index code is the reference data block 2;
  • the second data and the third data are string information, the second data "18" in the middle indicates that the number of index characters is 18; and the third data "1" indicates that the continuous string in the data to be compressed is in the reference data block.
  • the starting position in the first bit is the first bit.
  • the meaning of the index encoding is: indexing 18 characters from the first character of the reference data block 1, and adding the 18 characters indexed to the index when decompressing. The location of the code.
  • the 19th to the 42nd characters of the data to be compressed use the same indexing method to obtain the second set of index codes ⁇ C2, 12, 2> and the third set of index codes ⁇ C3, 12, 1>.
  • any one of the 43th to 45th characters of the data to be compressed cannot be found from the 3 reference data blocks, and the insertion code ⁇ I, 3, XYZ> can be generated at this time, and the first data in the insertion code is generated.
  • I indicates that a character is inserted at the current position;
  • the second data "3" in the middle indicates that the number of inserted characters is three;
  • the third data "XYZ” indicates the specific character inserted, that is, the meaning of the inserted code is: Add the string "XYZ" of length 3 to the current position.
  • the compression code obtained by compressing the data to be compressed in this example using reference data block 1 to reference data block 3 is: " ⁇ C1,18,1> ⁇ C2,12,2> ⁇ C3,12,1 > ⁇ I,3,XYZ>".
  • the computing device may acquire a plurality of reference data blocks having certain similarities with the data to be compressed, and compress the data to be compressed by the plurality of reference data blocks, when there is no data to be compressed.
  • the compressed data block with a high degree of similarity can also ensure a high compression efficiency.
  • the scheme shown in the embodiment of the present invention has a lower requirement for the similarity between the reference data block and the data to be compressed, and only needs to be
  • a simple similarity matching algorithm can meet the calculation requirements, and it is easy to match the reference data block that meets the requirements, which can save computing resources and computing time, and can improve the compression efficiency while ensuring the compression effect.
  • the value of m is greater than 1, that is, the computing device needs to find at least two reference data blocks matching the data to be compressed from each reference data block.
  • a reference data block that does not match the data to be compressed may be found, or only a reference data block that matches the data to be compressed may be found.
  • the computing device may find out according to the The number of reference data blocks matching the data to be compressed is selected by a different compression algorithm, such as:
  • the computing device compresses the compressed data block using a self-compression algorithm.
  • the specific implementation process of the self-compression algorithm can be as follows:
  • step 1) From the current compression position, examine the unencoded data. And try to find the longest matching string in the sliding window, if found, proceed to step 2), otherwise proceed to step 3); step 2) output the ternary symbol group (off, len, c).
  • step 3) output three The meta-symbol group (0,0,c), where c is the next character, then slide the window back len+1 characters and continue with step 1).
  • the size of the sliding window is 10 characters, and the characters are "abcdbbccaa", and the data to be compressed immediately after the sliding window is "abaeaaabaee".
  • the first character from the data to be compressed starts, and the longest matching string among the 10 characters in the sliding window is "ab", and the next character of "ab” is "a", and the output is output.
  • the triplet (0, 2, a) indicates that two characters are indexed from the character position at which the sliding window is shifted to 0 at this time, and the next character of the two characters is "a”. Then, the sliding window slides 3 characters in the direction of the data to be compressed, and the 10 characters in the sliding window are "dbbccaaaba".
  • the computing device may obtain the compression rate of the previous self-compression, and determine the preset threshold when calculating the similarity between the reference data block and the data to be compressed according to the compression rate of the previous self-compression.
  • a series of data to be compressed has similar redundancy. For example, in a period of time, the continuous redundancy of multiple data to be compressed is higher. For a period of time, the continuous redundancy of multiple data to be compressed is low, the needle
  • the computing device may calculate the compression ratio of the self-compression after selecting the self-compression algorithm and compressing the data. Regularly obtain the self-compressed compression ratio of the previous statistics, and determine a new preset threshold according to the compression ratio of the self-compression.
  • the average value of the compression ratios of the last 5 self-compressions may be taken, and a new pre-determination is determined according to the average value.
  • the threshold is set, and the higher the average value, the higher the self-redundancy of the data to be compressed in the last 5 self-compressions, and the data to be compressed after the last 5 self-compressions is referred to, and the data to be compressed may be followed.
  • the self-redundancy is also high.
  • the preset threshold can be appropriately adjusted to allow more data to be compressed to be compressed by the self-compression algorithm.
  • the average value is lower, the last five self-compressions are displayed.
  • the self-redundancy of the compressed data is also lower, and the self-redundancy of the data to be compressed may be lower.
  • the preset threshold may be appropriately lowered to allow more data to be compressed to pass.
  • the compression is performed by a single reference block-based differential compression algorithm or a multi-reference block-based differential compression algorithm; thereby achieving an effect of increasing the compression ratio in a scenario in which a plurality of consecutive data to be compressed have similar redundancy.
  • the computing device compresses the compressed data block using a differential compression algorithm based on a single reference data block.
  • reference data block when there is only one reference data block, it is assumed that the reference data block and the data to be compressed are as follows:
  • compression coding consisting of index coding and insertion coding can also be obtained: " ⁇ C, 18, 1> ⁇ I,24,234567891011abcdefghijkl>”.
  • the multi-reference data block-based differential compression method may be converted into a single reference data block-based differential compression method, that is, when M is greater than or equal to 2, m reference data is used. Blocks are connected end to end, obtain a total reference data block, and match the data to be compressed with the total reference data block to obtain compression coding of the data to be compressed.
  • the specific compression step can refer to the above description, and the simple processing can be
  • the plurality of reference data blocks are converted into a single reference data block, so that the differential compression algorithm based on the single reference data block can be compatible with multiple reference data blocks, and the compressed data is supported by the reference block with low similarity to the data to be compressed. Compression to achieve a higher compression ratio, without the need to additionally set a multi-reference block based compression algorithm, thereby simplifying the complexity of the algorithm.
  • the computing device acquires at least two reference data blocks that match the data to be compressed, and matches the data to be compressed with the m reference data blocks to obtain At least one index code, the index code is used to indicate a position of a consecutive character string corresponding to the index code in one of the m reference data blocks, and a continuous string corresponding to the index code exists in the Compressing the continuous character string in the data, compressing the data to be compressed by the plurality of reference data blocks, and ensuring high compression efficiency when there is no compressed data block with high similarity between the data to be compressed.
  • the data compression method shown in this exemplary embodiment requires less similarity between the reference data block and the data to be compressed, and the similarity matching algorithm is simple, and it is easy to match the reference data block that meets the requirements. It can improve the compression efficiency while ensuring the compression effect.
  • FIG. 3 shows a block diagram of a data compression apparatus provided by an exemplary embodiment of the present invention.
  • the data compression device can be implemented as all or part of the computing device 110 shown in FIG. 1 by software, hardware or a combination of both.
  • the data compression device can be implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (abbreviated as PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above PLD can be a complex programmable logic device (English: complex programmable logic device, abbreviation: CPLD), field-programmable gate array (English: field-programmable gate array, abbreviation: FPGA), general array logic (English: generic array Logic, abbreviation: GAL) or any combination thereof.
  • each unit in the data compression apparatus may also be a software module.
  • the data compression apparatus may include: an obtaining unit 301 and a matching unit 302.
  • the obtaining unit 301 is configured to acquire data to be compressed and m reference data blocks, where m is greater than 1, and m is an integer;
  • the matching unit 302 is configured to match the to-be-compressed data with the m reference data blocks to obtain at least one index code, where each of the index codes includes a reference data block identifier and a string information, where each The reference data block identifier indicates one of the m reference data blocks, and each of the character string information indicates location information of consecutive strings in the data to be compressed in the reference data block.
  • the acquiring unit 301 is specifically configured to: when acquiring m reference data blocks, calculate a similarity between each of the preset reference data blocks and the data to be compressed, and obtain And taking m reference data blocks in which the similarity between the data to be compressed and the data to be compressed is greater than a preset threshold.
  • the obtaining unit 301 is configured to: when acquiring, in the respective reference data blocks, m reference data blocks whose similarity between the data to be compressed is greater than a preset threshold, each calculated one When the similarity between the data to be compressed is greater than the reference data block of the preset threshold, the number of matches is increased by 1, and the initial value of the matching quantity is 0; whether the matching quantity reaches a preset quantity The upper limit M, M ⁇ 2, and M is an integer; if the result of the determination is that the number of matches reaches a preset upper limit M, the similarity between the calculated data and the data to be compressed is greater than the preset
  • the reference data block of the threshold is obtained as m reference data blocks that match the data to be compressed; if the result of the determination is that the number of matches does not reach the preset upper limit M, the respective reference data blocks are determined to be Determining whether the similarity between the compressed data is all calculated; if the result of the determination is that the similarity between each of the reference data blocks and the data to
  • the device further includes: a generating unit 303;
  • the generating unit 303 is configured to: when there is a continuous character string that does not correspond to the at least one index code in the data to be compressed, generate an insertion code that includes a continuous character string that does not correspond to the at least one index code, The insertion code is used to indicate that a continuous string that does not correspond to the at least one index code is inserted at the time of decompression.
  • the matching unit 302 is specifically configured to connect the m reference data blocks end to end, obtain a total reference data block, and match the to-be-compressed data with the total reference data block.
  • the computing device acquires at least two reference data blocks that match the data to be compressed, and matches the data to be compressed with the m reference data blocks to obtain At least one index code, the index code is used to indicate a position of a consecutive character string corresponding to the index code in one of the m reference data blocks, and a continuous string corresponding to the index code exists in the Compressing the continuous character string in the data, compressing the data to be compressed by the plurality of reference data blocks, and ensuring high compression efficiency when there is no compressed data block with high similarity between the data to be compressed.
  • the data compression apparatus shown in this exemplary embodiment has a low requirement for similarity between the reference data block and the data to be compressed, and the algorithm for similarity matching is simple, and it is easy to match the reference data block that meets the requirements. It can improve the compression efficiency while ensuring the compression effect.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种数据压缩方法,属于数据处理领域。所述方法包括:获取待压缩数据与m个参考数据块,m大于1,且m为整数(201);将该待压缩数据与该m个参考数据块进行匹配,获得至少一个索引编码,每个索引编码包括参考数据块标识和字符串信息,每个参考数据块标识指示m个参考数据块中的一个参考数据块,每个字符串信息指示待压缩数据中的连续字符串在参考数据块中的位置信息(202);实现了通过多个参考数据块对待压缩数据进行压缩来提升压缩效率。

Description

数据压缩装置及方法 技术领域
本发明涉及数据处理领域,特别涉及一种数据压缩装置及方法。
背景技术
在数据处理领域中,数据压缩是一种通过特定步骤来减少重复数据,达到缩减存储空间的方法。
差分(英文:Delta)压缩是目前较为常用的一种无损数据压缩方法。其主要包括如下步骤:通过特定的相似度检测算法从若干个参考数据块中检测出一个与待压缩数据相似度最高的参考数据块,并对待压缩数据与该参考数据块进行差异计算,以获得压缩结果。
在实现本发明的过程中,发明人发现现有技术至少存在以下问题:
现有的差分压缩算法对参考数据块与待压缩数据之间的相似度要求较高,当不存在与待压缩数据之间的相似度较高的压缩数据块时,压缩效果较差。
发明内容
为了解决现有技术中当不存在与待压缩数据之间的相似度较高的压缩数据块时,压缩效果较差的问题,本发明实施例提供了一种数据压缩装置及方法。所述技术方案如下:
第一方面,提供了一种数据压缩方法,该方法包括:获取待压缩数据与m个参考数据块,m大于1,且m为整数;将该待压缩数据与该m个参考数据块进行匹配,获得至少一个索引编码,每个索引编码包括参考数据块标识和字符串信息,每个参考数据块标识指示m个参考数据块中的一个参考数据块,每个字符串信息指示待压缩数据中的连续字符串在参考数据块中的位置信息。
本发明实施例提供的数据压缩方法,通过多个参考数据块对待压缩数据进行压缩,当不存在与待压缩数据之间的相似度较高的压缩数据块时,也能够保证较高的压缩效率。此外,本发明实施例所示的数据压缩方法对参考数据块与待压缩数据之间的相似度要求较低,相似度匹配的算法简单,且很容 易匹配出符合要求的参考数据块,能够在保证压缩效果的前提下提高压缩效率。
在第一方面的第一种可能的实施方式中,获取m个参考数据块时,逐一计算预先设置的各个参考数据块各自与待压缩数据之间的相似度,获取该各个参考数据块中,与该待压缩数据之间的相似度大于预设阈值的m个参考数据块。
结合第一方面的第一种可能的实施方式,在第一方面的第二种可能的实施方式中,该获取该各个参考数据块中,与该待压缩数据之间的相似度最大的m个参考数据块,包括:每计算出一个与该待压缩数据之间的相似度大于该预设阈值的参考数据块时,将匹配数量加1,该匹配数量的初值为0;判断该匹配数量是否达到预设的数量上限M,M≥2,且M为整数;若判断结果为该匹配数量达到预设的数量上限M,则将已计算出的与该待压缩数据之间的相似度大于该预设阈值的参考数据块获取为与该待压缩数据相匹配的m个参考数据块;若判断结果为该匹配数量未达到预设的数量上限M,则判断该各个参考数据块各自与该待压缩数据之间的相似度是否全部计算完毕;若判断结果为该各个参考数据块各自与该待压缩数据之间的相似度全部计算完毕,则将已计算出的与该待压缩数据之间的相似度大于该预设阈值的参考数据块获取为与该待压缩数据相匹配的m个参考数据块。
本发明实施例提供的数据压缩方法,只需要从若干个参考数据块中获取出多个相似度要求较低的参考数据块即可,由于对参考数据块与待压缩数据之间的相似度要求较低,因此,逐一计算各个参考数据块与该待压缩数据之间的相似度,当计算出与待压缩数据相匹配的参考数据块的数量足够时,可以停止后续计算过程,从而缩短匹配过程,提高压缩效率。
在第一方面的第三种可能的实施方式中,该方法还包括:当该待压缩数据中存在未对应该至少一个索引编码的连续字符串时,生成包含未对应该至少一个索引编码的连续字符串的插入编码,该插入编码用于指示在解压缩时插入未对应该至少一个索引编码的连续字符串。
在第一方面的第四种可能的实施方式中,根据该待压缩数据与该m个参考数据块进行匹配,包括:将该m个参考数据块首尾相连,获得一个总参考数据块;将该待压缩数据与该总参考数据块进行匹配。
第二方面,本发明实施例提供了一种计算设备,该计算设备包括:处理 器、存储器和总线;存储器通过总线连接处理器;处理器被配置为执行存储器中存储的指令;处理器通过执行指令来实现上述第一方面或第一方面中任意一种可能的实现方式所提供的数据压缩方法。
第三方面,本发明实施例提供了一种数据压缩装置,该数据压缩装置包括至少一个单元,该至少一个单元用于实现上述第一方面或第一方面中任意一种可能的实现方式所提供的数据压缩方法。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明示例性实施例所示出的计算设备的结构框图;
图2A是本发明一示例性实施例示出的数据压缩方法的方法流程图;
图2B是图2A所示实施例涉及的一种获取参考数据块的方法流程图;
图2C是图2A所示实施例涉及的一种基于多参考数据块的差分压缩方法的流程图;
图3是本发明一个示例性实施例提供的数据压缩装置的框图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
图1是本发明一示例性实施例所示出的计算设备的结构框图。该计算设备100可以包括:处理器110、存储器130以及总线150。
存储器130通过总线150与处理器110相连接。
处理器110包括运算逻辑部件、寄存器部件以及控制部件等,其可以是独立的中央处理器,或者也可以是嵌入式处理器,比如微处理器(英文:Micro Processor Unit,缩写:MPU)、微控制器(英文:Microcontroller Unit,缩写:MCU)或者数字信号处理器(英文:Embedded Digital Signal Processor,缩写:EDSP)等。
存储器130由任何类型的易失性或非易失性存储设备或者它们的组合实 现,如静态随机存取存储器(英文:Static Random Access Memory,简称:SRAM),电可擦除可编程只读存储器(英文:Electrically Erasable Programmable Read-Only Memory,简称:EEPROM),可擦除可编程只读存储器(英文:Erasable Programmable Read Only Memory,简称:EPROM),可编程只读存储器(英文:Programmable Read-Only Memory,简称:PROM),只读存储器(英文:Read Only Memory,简称:ROM),磁存储器,快闪存储器,磁盘或光盘。存储器130可用于存储指令,该指令可以实现为软件程序或者软件模块。
其中,处理器110可以通过执行存储器130中存储的指令来实现下文图2A所示实施例中的数据压缩方法的全部或者部分步骤。
可选的,计算设备100还可以包括通信组件120和高速缓存器140等部件。通信组件120与高速缓存器140分别通过总线150与处理器110相连。
其中,通信组件120用于对外通信,包括对外部网络通信或者与其它计算或存储设备之间的通信。其可以包括多种类型接口,比如以太网络接口或者无线收发器等。
高速缓存器140用于缓存处理器110计算过程中的一些中间数据。
图2A是本发明一示例性实施例示出的数据压缩方法的方法流程图,该方法可以用于如图1所示的计算设备中。如图2A所示,该数据压缩方法可以包括:
步骤201,获取待压缩数据与m个参考数据块,m大于1,且m为整数。
具体的,计算设备在获取m个参考数据块时,可以逐一计算预先设置的各个参考数据块各自与该待压缩数据之间的相似度,并获取该各个参考数据块中,与该待压缩数据之间的相似度大于预设阈值的m个参考数据块。
在本示例性实施例中,计算设备中预先存储有若干个待选择的参考数据块,在对待压缩数据进行压缩时,只需要从若干个参考数据块中获取出多个相似度要求较低的参考数据块即可,比如,计算设备可以使用较为简单的相似度计算算法计算各个参考数据块各自与待压缩数据之间的相似度,并将其中相似度大于预设阈值的m个参考数据块获取为与待压缩数据相匹配的m个参考数据块,或者,计算设备也可以将其中相似度最大的m个参考数据块获取为与待压缩数据相匹配的m个参考数据块。
由于对参考数据块与待压缩数据之间的相似度要求较低,可能存在很多与待压缩数据相匹配的参考数据块,而实际应用中可能并不需要如此多的参考数据块,因此,在本示例性实施例一种可能的实现方式中,可以逐一计算各个参考数据块与该待压缩数据之间的相似度,当计算出与待压缩数据相匹配的参考数据块的数量足够时,可以停止后续计算过程,从而缩短匹配过程,提高压缩效率。具体的,请参考图2B,其示出了图2A所涉及的一种获取参考数据块的方法流程图,如图2B所示,该方法可以包括如下步骤:
步骤201a,计算各个参考数据块与该待压缩数据之间的相似度。
执行图2A所示的方法的设备中,可以预存N个参考数据块,N大于等于M,步骤201a至步骤201e从这N个参考数据块中选取M个与待压缩数据相似度较高的参考数据块。
在一种可能的实现方式中,在计算一个参考数据块与待压缩数据之间的相似度时,可以查找该参考数据块与待压缩数据之间的相同特征来计算,该特征可以由参考数据块或者待压缩数据的拉宾(英文:Rabin)指纹来表示,比如,对于参考数据块和待压缩数据,分别选择固定数量的Rabin指纹组成各自的特征子集,该特征子集中的每个指纹是对应特征子集的一个哈希(英文:Hash)值,通过计算两个特征子集中相匹配的Rabin指纹的数量来确定该参考数据块和待压缩数据之间的相似度。
在计算一个参考数据块和待压缩数据之间的相似度时,可以对两者各自对应的特征子集中的Rabin指纹进行逐一比对,直至全部比对结束,或者,比对出匹配的Rabin指纹在特征子集中所占的比例已经达到预设阈值。比如,当预设阈值为20%,参考数据块和待压缩数据各自对应的特征子集各包含5个Rabin指纹,在计算相似度时,首先比对两个特征子集各自的第一个Rabin指纹,如果匹配,则比对结束,确定该参考数据块与该待压缩数据相匹配,否则,继续比对两个特征子集各自的第二个Rabin指纹,依次类推,直至全部比对完毕,或者,比对到匹配的Rabin指纹。
如果参考数据块和待压缩数据各自对应的特征子集中的Rabin指纹非常多,为了提高匹配效率,还可以将特征子集降为超级特征子集,或者,将特征子集中的Rabin指纹缩减为超级指纹,以组成的指纹数量更少的特征子集。
步骤201b,每计算出一个与该待压缩数据之间的相似度大于预设阈值的参考数据块时,则将匹配数量加1。
步骤201c,判断该匹配数量是否达到预设的数量上限M,若是,进入步骤201d,否则,进入步骤201e。
其中,该匹配数量的初值为0,M≥2,且M为整数。在本示例性实施例中,M是一个预先设置的固定数值,该数值可以由开发人员或者用户根据实际的压缩场景进行设定。
步骤201d,将已计算出的与该待压缩数据之间的相似度大于该预设阈值的参考数据块获取为与该待压缩数据相匹配的m个参考数据块。
在本示例性实施例中,逐一计算各个参考数据块与该待压缩数据之间的相似度时,每计算出一个与待压缩数据相匹配的参考数据块,即判断已计算出的、与待压缩数据相匹配的参考数据块是否足够,若是,则获取已计算出的各个与待压缩数据相匹配的参考数据块,并停止后续计算过程。
步骤201e,判断该各个参考数据块各自与该待压缩数据之间的相似度是否全部计算完毕,若是,进入步骤201d,否则,返回步骤201a。
若判断结果为该各个参考数据块各自与该待压缩数据之间的相似度未全部计算完毕,则继续计算下一个参考数据块与该待压缩数据之间的相似度。
在每计算出一个与待压缩数据相匹配的参考数据块时,如果判断出已计算出的、与待压缩数据相匹配的参考数据块还不够,则可以继续进行后续计算,直至数量足够或者所有的参考数据块全部计算结束,当所有的参考数据块全部计算结束时,无论与待压缩数据相匹配的参考数据块的数量是否达到M,都获取已计算出的各个与待压缩数据相匹配的参考数据块,并停止后续计算过程。
步骤202,将该待压缩数据与该m个参考数据块进行匹配,获得至少一个索引编码,每个索引编码包括参考数据块标识和字符串信息,每个参考数据块标识指示m个参考数据块中的一个参考数据块,每个字符串信息指示待压缩数据中的连续字符串在参考数据块中的位置信息。
其中,每个索引编码对应待压缩数据中的一个连续字符串,且该索引编码指示对应的连续字符串存在于该m个参考数据块中的哪一个参考数据块中的哪个位置。比如,该索引编码中的字符串信息可以包含对应的连续字符串在参考数据块中的起始位置以及字符数,表示该参考数据块的标识对应的参考数据块中,从起始位置开始的指定数量的字符同样存在于待压缩数据中。
步骤203,当该待压缩数据中存在未对应该至少一个索引编码的连续字 符串时,生成包含未对应该至少一个索引编码的连续字符串的插入编码,该插入编码用于指示在解压缩时插入未对应该至少一个索引编码的连续字符串。
在实际应用中,可能出现待压缩数据中的部分字符不存在于m个参考数据块的任何一个参考数据块中的情况,此时,计算设备可以对应生成插入编码,每个插入编码中包含不存在于m个参考数据块的任何一个参考数据块中的一个连续字符串。
步骤204,按照该至少一个索引编码和插入编码各自对应的连续字符串在待压缩数据中的先后顺序,将该至少一个索引编码和插入编码输出为压缩结果。
具体的,请参考图2C,其示出了图2A所示的示例性实施例涉及的一种基于多参考数据块的差分压缩方法的流程图,如图2C所示,该方法可以包括如下步骤:
步骤20a,将待压缩数据划分为若干个连续字符串。
其中,该连续字符串是存在目标参考块的字符串,该目标参考块是该m个参考数据块中包含该连续字符串的数据块;或者,该连续字符串是包含的任意一个字符在该m个参考数据块中都不存在的字符串。
在本发明实施例中,划分出的连续字符串可以有以下两类:
第一类是存在于m个参考数据块的一个参考数据块中的连续字符串,此类字符串的划分方法可以如下:
计算设备从待压缩数据中未划分的字符中的第一个字符开始,查询该第一个字符是否存在于m个参考数据块的一个参考数据块中,如果是,则继续查询未划分的字符中的前两个字符组成的字符串是否存在于m个参考数据块的一个参考数据块中,以此类推,直到查询出未划分的字符中的前p个字符组成的字符串存在于m个参考数据块的一个参考数据块中,且未划分的字符中的前p+1个字符组成的字符串不存在于m个参考数据块的任一参考数据块中时,将未划分的字符中的前p个字符组成的字符串划分为一个连续字符串,p为大于等于1的整数。
或者,计算设备也可以为第一类连续字符串设置一个字符数阈值q,当p达到q时,不再查询未划分的字符中的前q+1个字符组成的字符串是否存在于m个参考数据块的一个参考数据块中,直接将未划分的字符中的前q 个字符组成的字符串划分为一个连续字符串。
第二类是不存在于m个参考数据块的任一参考数据块中的连续字符串,此类字符串的划分方法可以如下:
计算设备从待压缩数据中未划分的字符中的第一个字符开始,查询该第一个字符是否不存在于m个参考数据块的任一参考数据块中,如果是,则继续查询未划分的字符中的第二个字符是否不存在于m个参考数据块的任一参考数据块中,以此类推,直到查询出未划分的字符中的第p’个字符不存在于m个参考数据块的任一参考数据块中,且未划分的字符中的第p’+1个字符存在于m个参考数据块的一个参考数据块中时,将未划分的字符中的前p’个字符组成的字符串划分为一个连续字符串,p’为大于等于1的整数。
或者,计算设备也可以为第二类连续字符串设置一个字符数阈值q’,当p’达到q’时,不再查询未划分的字符中的第q’+1个字符是否不存在于m个参考数据块的任一参考数据块中,直接将未划分的字符中的前q’个字符组成的字符串划分为一个连续字符串。
步骤20b,当一个连续字符串是存在于m个参考数据块的一个参考数据块中的连续字符串时,生成该连续字符串的索引编码。
步骤20c,当一个连续字符串是不存在于m个参考数据块的任一参考数据块中的连续字符串时,生成包含该连续字符串的插入编码。
步骤20d,将生成的各个索引编码和插入编码按照各自对应的连续字符串在该待压缩数据中的位置进行排列,获得该待压缩数据对应的压缩编码。
比如,假设m的值取3,即有3个参考数据块。3个参考数据块和待压缩数据如下所示:
参考数据块1为:ABCDEFGHIABCDEFGHIMNOPQRST
参考数据块2为:12345678910111213141516171
参考数据块3为:abcdefghijklmnopqrstuvwxyz
待压缩数据为:ABCDEFGHIABCDEFGHI234567891011abcdefghijklXYZ
在使用上述参考数据块1至参考数据块3对待压缩数据进行压缩时,首先,对待压缩数据第一个字符“A”在参考数据块中进行检索,找到参考数据块1中有相同的字符“A”;之后,连续对比参考数据块1的字符“A”和待压缩数据的字符“A”之后的各个字符,确定参考数据块1从字符“A”开始的一共18个字符和待压缩数据从“A”开始的18个字符相同。参考压缩块1从“A”开 始的第19个字符是“M”,而待压缩数据从“A”开始的第19个字符是“2”,两者不同,此时,生成第一组索引编码为“<C1,18,1>”。其中,索引编码由固定格式“<,,>”和三个数据组成。其中,第一数据为参考数据块标识,“C1”表示该索引编码使用的参考数据块是参考数据块1,同理可知,“C2”表示索引编码使用的参考数据块是参考数据块2;第二数据和第三个数据为字符串信息,中间的第二数据“18”表示索引字符的个数是18个;第三数据“1”表示待压缩数据中的连续字符串在参考数据块中的起始位置是第1位,该索引编码的含义为:从参考数据块1的第1个字符开始索引18个字符,在解压缩时,将索引到的这18个字符添加到该索引编码所在位置。
同理,待压缩数据的第19位字符至第42位字符运用同样的索引方式得到第二组索引编码<C2,12,2>以及第三组索引编码<C3,12,1>。
而待压缩数据的第43至第45个字符中的任意一个都不能从3个参考数据块中找到,此时可以生成插入编码<I,3,XYZ>,该插入编码中的第一数据“I”表示在当前位置插入字符;中间的第二数据“3”表示插入的字符的个数是3个;第三个数据“XYZ”表示插入的具体字符,即该插入编码的含义为:从当前位置添加长度为3的字符串“XYZ”。
由此,本例中的待压缩数据使用参考数据块1至参考数据块3进行压缩后得到的压缩编码为:“<C1,18,1><C2,12,2><C3,12,1><I,3,XYZ>”。
在本发明实施例上述方案中,计算设备可以获取多个与待压缩数据具有一定相似性的参考数据块,通过该多个参考数据块对待压缩数据进行压缩,当不存在与待压缩数据之间的相似度较高的压缩数据块时,也能够保证较高的压缩效率;同时,本发明实施例所示的方案对于参考数据块与待压缩数据之间的相似度的要求较低,只需要简单的相似度匹配算法即可以满足计算要求,且很容易匹配出符合要求的参考数据块,能够节约运算资源和运算时间,能够在保证压缩效果的前提下提高压缩效率。
在本发明上述方案中,m的值大于1,即计算设备需要从各个参考数据块中至少找出两个与待压缩数据相匹配的参考数据块。在实际应用中,可能出现查找不到与待压缩数据相匹配的参考数据块,或者,只查找到一个与待压缩数据相匹配的参考数据块的情形,对此,计算设备可以根据查找出的与待压缩数据相匹配的参考数据块的数目选择不同的压缩算法,具体比如:
一、当查找不到与待压缩数据相匹配的参考数据块时,计算设备使用自压缩算法对待压缩数据块进行压缩。
比如,在一种可以实现的实施方式中,自压缩算法的具体实现过程可以如下:
首先从待压缩数据中要进行压缩的字符开始,以该字符为起点向后寻找一段一定长度的字符串作为滑动窗口,并执行下列步骤:1)从当前压缩位置开始,考察未编码的数据,并试图在滑动窗口中找出最长的匹配字符串,如果找到,则进行步骤2),否则进行步骤3);步骤2)输出三元符号组(off,len,c)。其中off为窗口中匹配字符串相对窗口边界的偏移,len为可匹配的长度,c为下一个字符,然后将窗口向后滑动len+1个字符,继续步骤1);步骤3)输出三元符号组(0,0,c),其中c为下一个字符,然后将窗口向后滑动len+1个字符,继续步骤1)。
例如,滑动窗口的大小为10个字符长度,其中的字符分别为“abcdbbccaa”,紧跟在滑动窗口后的待压缩数据为“abaeaaabaee”。
首先,确定从待压缩数据中第一个字符开始,与滑动窗口中的10个字符中,最长的匹配字符串为“ab”,“ab”的下一个字符为“a”,此时输出三元组(0,2,a),表示从此时滑动窗口位移为0的字符位置开始索引2个字符,该2个字符的下一个字符为“a”。然后,滑动窗口向待压缩数据的方向滑动3个字符位,此时滑动窗口内的10个字符是“dbbccaaaba”。由于已被压缩的3个待压缩数据“aba”下一个字符是“e”,“e”在滑动窗口中没有和其相同的字符,所以输出三元组(0,0,e),表示从滑动窗口位移为0的字符位置开始索引0个字符,该字符后的一个字符为“e”。之后,滑动窗口向后滑动1个字符位,此时滑动窗口内的10个字符是“bbccaaabae”,同理,得到输出三元组(4,6,e)。此时,待压缩数据“abaeaaabaee”变成了一组索引(0,2,a)、(0,0,e)和(4,6,e),自此完成对待压缩数据的压缩。
可选的,计算设备可以获取历次自压缩的压缩率,并根据历次自压缩的压缩率确定上述计算参考数据块与该待压缩数据之间的相似度时的预设阈值。
在数据压缩的过程中,可能会出现如下场景:一连串的待压缩数据的自身冗余度较为相似,比如,在一段时间内,连续多个待压缩数据的自身冗余度都较高,在另一段时间内,连续多个待压缩数据的自身冗余度都较低,针 对这样的压缩场景,在本示例性实施例的一种可能的实现方式中,计算设备每次选择自压缩算法并进行压缩后,可以统计本次自压缩的压缩率,在压缩数据过程中,定期获取历次统计的自压缩的压缩率,并根据自压缩的压缩率确定新的预设阈值,比如,可以取最近5次自压缩时的压缩率的平均值,根据该平均值确定新的预设阈值,该平均值越高,说明最近5次自压缩时的待压缩数据的自身冗余度也越高,以最近5次自压缩时的待压缩数据为参考,可能之后的待压缩数据的自身冗余度也较高,此时,可以适当上调预设阈值,让更多的待压缩数据通过自压缩算法进行压缩;相应的,如果平均值越低,说明最近5次自压缩时的待压缩数据的自身冗余度也越低,可能之后的待压缩数据的自身冗余度也较低,此时,可以适当下调预设阈值,让更多的待压缩数据通过基于单参考数据块的差分压缩算法或者基于多参考数据块的差分压缩算法进行压缩;从而达到在连续多个待压缩数据的自身冗余度较为相似的场景下,提高压缩率的效果。
二、当只查找到一个与待压缩数据相匹配的参考数据块时,计算设备使用基于单参考数据块的差分压缩算法对待压缩数据块进行压缩。
比如,当只有一个参考数据块时,假设该参考数据块和待压缩数据如下所示:
参考数据块:ABCDEFGHIABCDEFGHIMNOPQRST
待压缩数据:ABCDEFGHIABCDEFGHI234567891011abcdefghijkl
参考上述步骤20a至步骤20d所示的方法,上述待压缩数据经过基于单参考数据块的差分压缩后,同样可以获得由索引编码和插入编码组成的压缩编码:“<C,18,1><I,24,234567891011abcdefghijkl>”。
可选的,在又一种可能的实现方式中,可以将基于多参考数据块的差分压缩方法转化为基于单参考数据块的差分压缩方法,即M大于或者等于2时,将m个参考数据块首尾相连,获得一个总参考数据块,并将待压缩数据与总参考数据块进行匹配,获得该待压缩数据的压缩编码,其具体压缩步骤可以参考上面的描述,通过简单的处理即可以将多个参考数据块转化为单个参考数据块,使基于单个参考数据块的差分压缩算法能够兼容多个参考数据块,在支持通过多个与待压缩数据相似度较低的参考块对待压缩数据进行压缩来达到较高的压缩率的同时,不需要额外设置基于多参考数据块的压缩算法,从而简化算法的复杂度。
综上所述,本示例性实施例所示的数据压缩方法,计算设备获取与待压缩数据相匹配的至少两个参考数据块,将该待压缩数据与该m个参考数据块进行匹配,获得至少一个索引编码,该索引编码用于指示该索引编码对应的连续字符串在该m个参考数据块中的一个参考数据块中的位置,且该索引编码对应的连续字符串是存在于该待压缩数据中的连续字符串,通过该多个参考数据块对待压缩数据进行压缩,当不存在与待压缩数据之间的相似度较高的压缩数据块时,也能够保证较高的压缩效率。
此外,本示例性实施例所示的数据压缩方法,对参考数据块与待压缩数据之间的相似度要求较低,相似度匹配的算法简单,且很容易匹配出符合要求的参考数据块,能够在保证压缩效果的前提下提高压缩效率。
请参考图3,其示出了本发明一个示例性实施例提供的数据压缩装置的框图。该数据压缩装置可以通过软件、硬件或者两者的结合实现成为上述图1所示计算设备110的全部或者一部分。该数据压缩装置可以通过专用集成电路(英文:application-specific integrated circuit,缩写:ASIC)实现,或可编程逻辑器件(英文:programmable logic device,缩写:PLD)实现。上述PLD可以是复杂可编程逻辑器件(英文:complex programmable logic device,缩写:CPLD),现场可编程逻辑门阵列(英文:field-programmable gate array,缩写:FPGA),通用阵列逻辑(英文:generic array logic,缩写:GAL)或其任意组合。当图2所示的数据压缩方法通过软件实现时,该数据压缩装置中的各个单元也可以为软件模块。如图3所示,该数据压缩装置可以包括:获取单元301和匹配单元302。
所述获取单元301,用于获取待压缩数据与m个参考数据块,m大于1,且m为整数;
所述匹配单元302,用于将所述待压缩数据与所述m个参考数据块进行匹配,获得至少一个索引编码,每个所述索引编码包括参考数据块标识和字符串信息,每个所述参考数据块标识指示所述m个参考数据块中的一个参考数据块,每个所述字符串信息指示所述待压缩数据中的连续字符串在参考数据块中的位置信息。
可选的,所述获取单元301,具体用于在获取m个参考数据块时,逐一计算预先设置的各个参考数据块各自与所述待压缩数据之间的相似度,并获 取所述各个参考数据块中,与所述待压缩数据之间的相似度大于预设阈值的m个参考数据块。
可选的,所述获取单元301,具体用于在获取所述各个参考数据块中,与所述待压缩数据之间的相似度大于预设阈值的m个参考数据块时,每计算出一个与所述待压缩数据之间的相似度大于所述预设阈值的参考数据块时,将匹配数量加1,所述匹配数量的初值为0;判断所述匹配数量是否达到预设的数量上限M,M≥2,且M为整数;若判断结果为所述匹配数量达到预设的数量上限M,则将已计算出的与所述待压缩数据之间的相似度大于所述预设阈值的参考数据块获取为与所述待压缩数据相匹配的m个参考数据块;若判断结果为所述匹配数量未达到预设的数量上限M,则判断所述各个参考数据块各自与所述待压缩数据之间的相似度是否全部计算完毕;若判断结果为所述各个参考数据块各自与所述待压缩数据之间的相似度全部计算完毕,则将已计算出的与所述待压缩数据之间的相似度大于所述预设阈值的参考数据块获取为与所述待压缩数据相匹配的m个参考数据块。
可选的,所述装置还包括:生成单元303;
所述生成单元303,用于当所述待压缩数据中存在未对应所述至少一个索引编码的连续字符串时,生成包含未对应所述至少一个索引编码的连续字符串的插入编码,所述插入编码用于指示在解压缩时插入未对应所述至少一个索引编码的连续字符串。
可选的,所述匹配单元302,具体用于将所述m个参考数据块首尾相连,获得一个总参考数据块,将所述待压缩数据与所述总参考数据块进行匹配。
综上所述,本示例性实施例所示的数据压缩装置,计算设备获取与待压缩数据相匹配的至少两个参考数据块,将该待压缩数据与该m个参考数据块进行匹配,获得至少一个索引编码,该索引编码用于指示该索引编码对应的连续字符串在该m个参考数据块中的一个参考数据块中的位置,且该索引编码对应的连续字符串是存在于该待压缩数据中的连续字符串,通过该多个参考数据块对待压缩数据进行压缩,当不存在与待压缩数据之间的相似度较高的压缩数据块时,也能够保证较高的压缩效率。
此外,本示例性实施例所示的数据压缩装置,对参考数据块与待压缩数据之间的相似度要求较低,相似度匹配的算法简单,且很容易匹配出符合要求的参考数据块,能够在保证压缩效果的前提下提高压缩效率。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (7)

  1. 一种数据压缩方法,其特征在于,所述方法包括:
    获取待压缩数据与m个参考数据块,m大于1,且m为整数;
    将所述待压缩数据与所述m个参考数据块进行匹配,获得至少一个索引编码,每个所述索引编码包括参考数据块标识和字符串信息,每个所述参考数据块标识指示所述m个参考数据块中的一个参考数据块,每个所述字符串信息指示所述待压缩数据中的连续字符串在参考数据块中的位置信息。
  2. 根据权利要求1所述的方法,其特征在于,所述获取m个参考数据块,包括:
    计算预先设置的各个参考数据块与所述待压缩数据之间的相似度;
    获取所述各个参考数据块中,与所述待压缩数据之间的相似度大于预设阈值的m个参考数据块。
  3. 根据权利要求2所述的方法,其特征在于,所述获取所述各个参考数据块中,与所述待压缩数据之间的相似度最大的m个参考数据块,包括:
    每计算出一个与所述待压缩数据之间的相似度大于所述预设阈值的参考数据块时,将匹配数量加1,所述匹配数量的初值为0;
    判断所述匹配数量是否达到预设的数量上限M,M≥2,且M为整数;
    若判断结果为所述匹配数量达到预设的数量上限M,则将已计算出的与所述待压缩数据之间的相似度大于所述预设阈值的参考数据块获取为与所述待压缩数据相匹配的m个参考数据块;
    若判断结果为所述匹配数量未达到预设的数量上限M,则判断所述各个参考数据块与所述待压缩数据之间的相似度是否全部计算完毕;
    若判断结果为所述各个参考数据块与所述待压缩数据之间的相似度全部计算完毕,则将已计算出的与所述待压缩数据之间的相似度大于所述预设阈值的参考数据块获取为与所述待压缩数据相匹配的m个参考数据块。
  4. 一种计算设备,其特征在于,所述计算设备包括:处理器、存储器和总线;所述存储器通过所述总线连接所述处理器;所述处理器被配置为执 行存储器中存储的指令;
    所述处理器通过执行所述存储器中存储的指令实现如权利要求1至3任一所述的数据压缩方法。
  5. 一种数据压缩装置,其特征在于,所述装置包括:
    获取单元,用于获取待压缩数据与m个参考数据块,m大于1,且m为整数;
    匹配单元,用于将所述待压缩数据与所述m个参考数据块进行匹配,获得至少一个索引编码,每个所述索引编码包括参考数据块标识和字符串信息,每个所述参考数据块标识指示所述m个参考数据块中的一个参考数据块,每个所述字符串信息指示所述待压缩数据中的连续字符串在参考数据块中的位置信息。
  6. 根据权利要求5所述的装置,其特征在于,所述获取单元,具体用于计算预先设置的各个参考数据块与所述待压缩数据之间的相似度,并获取所述各个参考数据块中,与所述待压缩数据之间的相似度大于预设阈值的m个参考数据块。
  7. 根据权利要求6所述的装置,其特征在于,所述获取单元,具体用于每计算出一个与所述待压缩数据之间的相似度大于所述预设阈值的参考数据块时,将匹配数量加1,所述匹配数量的初值为0;判断所述匹配数量是否达到预设的数量上限M,M≥2,且M为整数;若判断结果为所述匹配数量达到预设的数量上限M,则将已计算出的与所述待压缩数据之间的相似度大于所述预设阈值的参考数据块获取为与所述待压缩数据相匹配的m个参考数据块;若判断结果为所述匹配数量未达到预设的数量上限M,则判断所述各个参考数据块与所述待压缩数据之间的相似度是否全部计算完毕;若判断结果为所述各个参考数据块与所述待压缩数据之间的相似度全部计算完毕,则将已计算出的与所述待压缩数据之间的相似度大于所述预设阈值的参考数据块获取为与所述待压缩数据相匹配的m个参考数据块。
PCT/CN2016/101494 2016-01-26 2016-10-08 数据压缩装置及方法 WO2017128763A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610052310.7 2016-01-26
CN201610052310.7A CN105743509B (zh) 2016-01-26 2016-01-26 数据压缩装置及方法

Publications (1)

Publication Number Publication Date
WO2017128763A1 true WO2017128763A1 (zh) 2017-08-03

Family

ID=56247586

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/101494 WO2017128763A1 (zh) 2016-01-26 2016-10-08 数据压缩装置及方法

Country Status (2)

Country Link
CN (1) CN105743509B (zh)
WO (1) WO2017128763A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105743509B (zh) * 2016-01-26 2019-05-24 华为技术有限公司 数据压缩装置及方法
CN107783990B (zh) * 2016-08-26 2021-11-19 华为技术有限公司 一种数据压缩方法及终端
CN106850141A (zh) * 2017-01-20 2017-06-13 济南浪潮高新科技投资发展有限公司 一种使用deflate算法的物理信息系统数据无损压缩传输方法
CN109255090B (zh) * 2018-08-14 2021-08-03 华中科技大学 一种web图的索引数据压缩方法
CN110958212B (zh) * 2018-09-27 2022-04-12 阿里巴巴集团控股有限公司 一种数据压缩、数据解压缩方法、装置及设备
CN109474279B (zh) * 2018-11-05 2022-09-23 安庆师范大学 一种数据压缩方法及装置
CN112544038B (zh) * 2019-07-22 2024-07-05 华为技术有限公司 存储系统数据压缩的方法、装置、设备及可读存储介质
CN111061428B (zh) * 2019-10-31 2021-05-18 华为技术有限公司 一种数据压缩的方法及装置
CN117171399B (zh) * 2023-11-02 2024-02-20 云图数据科技(郑州)有限公司 基于云平台的新能源数据优化存储方法
CN117195005B (zh) * 2023-11-03 2024-01-26 山东四季车网络科技有限公司 基于智慧洗车的信息数据管理系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101383617A (zh) * 2007-09-07 2009-03-11 三星电子株式会社 数据压缩/解压缩设备和方法
CN102999543A (zh) * 2006-04-11 2013-03-27 Emc公司 利用了数据段的相似度的高效数据存储
CN104657362A (zh) * 2013-11-18 2015-05-27 深圳市腾讯计算机系统有限公司 数据存储、查询方法和装置
CN104753540A (zh) * 2015-03-05 2015-07-01 华为技术有限公司 数据压缩方法、数据解压方法和装置
CN105743509A (zh) * 2016-01-26 2016-07-06 华为技术有限公司 数据压缩装置及方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928850B (zh) * 2006-08-11 2011-04-13 白杰 基于数据字典的数据压缩方法、装置
CN102724500B (zh) * 2012-06-05 2015-10-14 沙基昌 一种视频数据的压缩/解压缩方法及系统
CN103326730B (zh) * 2013-06-06 2016-05-18 清华大学 数据并行压缩方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999543A (zh) * 2006-04-11 2013-03-27 Emc公司 利用了数据段的相似度的高效数据存储
CN101383617A (zh) * 2007-09-07 2009-03-11 三星电子株式会社 数据压缩/解压缩设备和方法
CN104657362A (zh) * 2013-11-18 2015-05-27 深圳市腾讯计算机系统有限公司 数据存储、查询方法和装置
CN104753540A (zh) * 2015-03-05 2015-07-01 华为技术有限公司 数据压缩方法、数据解压方法和装置
CN105743509A (zh) * 2016-01-26 2016-07-06 华为技术有限公司 数据压缩装置及方法

Also Published As

Publication number Publication date
CN105743509A (zh) 2016-07-06
CN105743509B (zh) 2019-05-24

Similar Documents

Publication Publication Date Title
WO2017128763A1 (zh) 数据压缩装置及方法
Bowe et al. Succinct de Bruijn graphs
KR101956031B1 (ko) 데이터 압축 장치 및 방법, 데이터 압축 장치를 포함하는 메모리 시스템
US10033405B2 (en) Data compression systems and method
US9998145B2 (en) Data processing method and device
US9054729B2 (en) System and method of compression and decompression
US9853660B1 (en) Techniques for parallel data compression
US8886616B2 (en) Blocklet pattern identification
US5936560A (en) Data compression method and apparatus performing high-speed comparison between data stored in a dictionary window and data to be compressed
US8832034B1 (en) Space-efficient, revision-tolerant data de-duplication
JP3889762B2 (ja) データ圧縮方法、プログラム及び装置
US10224957B1 (en) Hash-based data matching enhanced with backward matching for data compression
US10224959B2 (en) Techniques for data compression verification
TW538599B (en) A method of performing Huffman decoding
JP2011501837A (ja) テキスト文字列のツー・パスハッシュ抽出
CN110868222B (zh) Lzss压缩数据误码检测方法及装置
KR20170040343A (ko) 적응형 레이트 압축 해시 프로세싱 디바이스
US9236881B2 (en) Compression of bitmaps and values
US20130297649A1 (en) Compression ratio improvement by lazy match evaluation on the string search cam
US8868584B2 (en) Compression pattern matching
US7109895B1 (en) High performance Lempel Ziv compression architecture
CN111063394B (zh) 基于基因序列的物种快速查找及建库方法、系统和介质
US8976048B2 (en) Efficient processing of Huffman encoded data
US9455742B2 (en) Compression ratio for a compression engine
WO2017157038A1 (zh) 数据处理的方法、装置和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16887637

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16887637

Country of ref document: EP

Kind code of ref document: A1