WO2017071431A1 - 一种编码方法及装置 - Google Patents

一种编码方法及装置 Download PDF

Info

Publication number
WO2017071431A1
WO2017071431A1 PCT/CN2016/099593 CN2016099593W WO2017071431A1 WO 2017071431 A1 WO2017071431 A1 WO 2017071431A1 CN 2016099593 W CN2016099593 W CN 2016099593W WO 2017071431 A1 WO2017071431 A1 WO 2017071431A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
data
block
target data
matching
Prior art date
Application number
PCT/CN2016/099593
Other languages
English (en)
French (fr)
Inventor
关坤
冷继南
王工艺
全绍晖
沈建强
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP16858867.1A priority Critical patent/EP3361393A4/en
Publication of WO2017071431A1 publication Critical patent/WO2017071431A1/zh
Priority to US15/924,007 priority patent/US10305512B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6011Encoder aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication

Definitions

  • the present invention relates to the field of data compression technologies, and in particular, to an encoding method and apparatus.
  • the Delta algorithm is one of the lossless data compression techniques used to calculate the Delta encoding between a new file and a reference file already stored in the system. For example, when a new file needs to be stored, the new file is matched with multiple reference files already stored in the system, if the similarity between the new file and one of the reference files exceeds a preset threshold. Then, calculate a Delta code corresponding to the new file, and only need to store the Delta code in the system, without storing the new file itself. When restoring a new file, the new file can be restored based on a reference file similar to the new file and the Delta code corresponding to the new file. In this way, when storing files with similarity, file compression using Delta encoding can greatly save storage space.
  • XDelta coding is a commonly used Delta coding algorithm.
  • the core idea of XDelta coding is to find in the reference block whether there are sub-blocks that match the target block. For example, if the consecutive 3 or 4 bytes are the same, the matching is considered successful.
  • the present application provides an encoding method and apparatus for solving the technical problem of low data compression efficiency when using Delta encoding.
  • an encoding method comprising:
  • the length of the successful target data which is used to indicate the location of the data that matches the target data that was successfully matched this time.
  • a sub-object block in the target block is obtained (for example, referred to as a first sub-object block), it is hashed first, and then the corresponding hash is queried in the first hash table according to the operation result.
  • a value, and then finding a corresponding position in the reference block according to the hash value of the query that is, finding the first reference data, so that the first sub-target block can be matched backward at the position (ie, the first sub-target is The first target data of the block is matched with the first reference data, and the second target data in the target block is matched with the second reference data in the reference block).
  • the method before the querying the first hash table according to the first key value, the method further includes:
  • each reference data block includes n-bit reference data
  • the first sub-target block includes n-bit target data
  • the n is a positive integer
  • the key value of the first hash table is obtained by the hash operation of the reference data block.
  • the first hash table can be pre-built according to the reference data in the known reference block for later use.
  • the number of reference data participating in the hash operation each time the first hash table is constructed According to the quantity, the data amount of the template data involved in each operation when matching the target block (that is, the data amount of the data included in one sub-object block) needs to be equal, otherwise the length of the obtained operation result may be different and cannot be obtained. The case of matching.
  • the method before the generating the first coding sequence, further includes: placing the target block in the location The target data before the first sub-target block is matched with other reference data in the reference block before the first reference data.
  • the target data before the first sub-target block in the first sub-target block and the target block and the first reference data and the reference block may be located before the first reference data.
  • the other reference data is forward matched until it cannot match. In this way, as much target data as possible can be matched at one time, so as to effectively reduce the number of matches, and also reduce the number of subsequent generated code sequences, thereby reducing the burden on the system.
  • the first coding sequence further includes target data that is not successfully matched, and the unmatched successful target The data is the target data between the last target data corresponding to the previous coding sequence and the first target data that is successfully matched.
  • the coding sequence provides a field for storing the target data that is not successfully matched, so that the decoding result obtained in the subsequent decoding is as close as possible to the original target data.
  • the method further includes:
  • the hash value corresponding to the key value indicates the address of the target data in the target block, and the first target data is obtained according to the second hash value,
  • the first target data of the first sub-target block is matched with the first target data
  • the second target data in the target block is matched with the third target data in the target block
  • the second The target data is other target data located after the first target data of the first sub-target block
  • the third target data is other target data in the target block after the first target data
  • the target data in the target block before the first sub-target block is matched with other target data in the target block before the first target data, to obtain a first matching result;
  • the generating the first coding sequence includes:
  • the first coding sequence further includes And indicating, when the target data amount matched in the first matching result is greater than the second matching result, indicating that the data matching the target data that is successfully matched by the current match is located in the target block, or In a case where the target data amount matched in the second matching result is greater than the first matching result, data indicating that the target data successfully matched by the current matching is located in the reference block.
  • the first hash table and the second hash table are provided in the present application, and for a key value, query matching can be performed in the two hash tables.
  • the same hash value may appear in the first hash table and the second hash table, so when the table is searched and matched according to the operation result, two matching results may be obtained, so that a matching with a large amount of data may be selected.
  • the result is encoded, which improves the matching accuracy and increases the compression ratio.
  • a hash value corresponding to the first key value is not found in the first hash table, and the generating the first code sequence further includes:
  • a hash value corresponding to the key value in the second hash table indicates an address of the target data;
  • the first target data is obtained by the hash value corresponding to the first key value in the second hash table, and the first target data of the first sub-target block is matched with the first target data, and the target block is matched.
  • the second target data is matched with the third target data in the target block, the second target data is other target data located after the first target data of the first sub-target block, the third The target data is other target data in the target block that is located after the first target data;
  • the first coding sequence further includes an indication bit for indicating that data matching the target data that is successfully matched this time is located in the target block.
  • the first hash table and the second hash table are provided in the present application, and for a key value, query matching can be performed in the two hash tables. If a hash value appears only in one of the hash tables, then only one matching result may be obtained when performing table lookup and matching according to the operation result, then the matching result may be directly encoded.
  • the coding sequence in the present application provides an indication bit, which can indicate whether the data matching the target data that is successfully matched this time is located in the target block or the reference block, so that the subsequent decoding can accurately find and The target data corresponding to the coding sequence matches where the successfully matched target data matches, thereby improving the accuracy of the decoding.
  • the position of the data can be indicated by the indication bit directly in the coding sequence, and no additional storage space is needed for recording the information, which can effectively save storage space, and can be directly used for decoding, thereby improving decoding efficiency.
  • the method further includes:
  • the hash value corresponding to the first key value stored in the second hash table may be updated according to the hash value corresponding to the first key value of the current time, where the update may refer to the new hash value. Replace the old hash value in the second hash table, so that the hash value in the second hash table is updated in time. Improve the next match success rate.
  • the first hash table is queried according to the first key value, if a hash value corresponding to the first key value is not found in the first hash table, according to the first key value Query the second hash table;
  • the hash value corresponding to the value indicates the address of the first target data of the first sub-target block.
  • the first hash table is caused to be first according to updating the second hash table.
  • the hash value corresponding to the key value indicates the address of the first target data of the first sub-target block.
  • the hash value corresponding to the first key value may be inserted into the corresponding position in the second hash table, that is, according to the The hash value corresponding to the first key value updates the second hash table, so that when the first key value occurs again next time, the corresponding hash value can be found in the second hash table.
  • an encoding apparatus including:
  • a memory for storing instructions
  • the address of the reference data if the first hash value corresponding to the first key value is found in the first hash table, the root Acquiring the first reference data corresponding to the address indicated by the first hash value according to the first hash value, and matching the first target data of the first sub-target block with the first reference data,
  • the second target data in the target block is matched with the second reference data in the reference block, and the second target data is other target data located after the first target data of the first sub-target block,
  • the second reference data is other reference data in the reference block that is located after the first reference data;
  • the length of the successful target data which is used to indicate the location of the data that matches the target data that was successfully matched this time.
  • the processor is further configured to:
  • the reference data block is obtained from the reference block according to the first step length; each reference data block includes n-bit reference data, the first The sub-target block includes n-bit target data, and n is a positive integer;
  • the key value of the first hash table is obtained by the hash operation of the reference data block.
  • the processor is further configured to:
  • the first coding sequence further includes target data that is not successfully matched, and the unmatched successful target The data is the target data between the last target data corresponding to the previous coding sequence and the first target data that is successfully matched.
  • the processor is further configured to:
  • first target data matching first target data of the first sub-target block with the first target data, and second target data in the target block with a third target in the target block Data is matched, the second target data is other target data located after the first target data of the first sub-target block, and the third target data is located in the target block after the first target data And other target data, and matching target data in the target block before the first sub-target block with other target data in the target block before the first target data, to obtain a first matching result
  • the first sub-goal The first object data matches the first reference data, the second target data of the target block with the second reference data matches the reference block matches the reference block matches
  • the first coding sequence further includes And indicating, when the target data amount matched in the first matching result is greater than the second matching result, indicating that the data matching the target data that is successfully matched by the current match is located in the target block, or In a case where the target data amount matched in the second matching result is greater than the first matching result, data indicating that the target data successfully matched by the current matching is located in the reference block.
  • the processor is further configured to:
  • a hash value corresponding to the key value in the second hash table indicates an address of the target data;
  • the first target data is obtained by the hash value corresponding to the first key value in the second hash table, and the first target data of the first sub-target block is matched with the first target data, and the target block is matched.
  • the second target data is matched with the third target data in the target block, the second target data is other target data located after the first target data of the first sub-target block, the third The target data is other target data in the target block that is located after the first target data;
  • the first coding sequence further includes an indication bit for indicating that data matching the target data that is successfully matched this time is located in the target block.
  • the processor is further configured to:
  • the processor is further configured to:
  • the first hash table is queried according to the first key value, if a hash value corresponding to the first key value is not found in the first hash table, according to the first key value Query the second hash table;
  • the hash value corresponding to the value indicates the address of the first target data of the first sub-target block.
  • the processor is further configured to:
  • a hash value corresponding to the value indicates the The address of the first target data of the first sub-target block.
  • an encoding apparatus comprising means for performing the method of the first aspect.
  • the solution proposed by the present application can complete matching of sub-target blocks faster, saves data compression time, and improves data compression efficiency.
  • FIG. 2 is a schematic diagram of a process of constructing a first hash table according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a coding sequence in an embodiment of the present invention.
  • FIG. 5 is a structural block diagram of an encoding apparatus according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of an encoding apparatus according to an embodiment of the present invention.
  • the present invention fully considers the above problem.
  • a sub-object block is obtained, it is hashed first, and then the corresponding hash value is queried in the first hash table corresponding to the reference block according to the operation result, and then according to the query.
  • the hash value finds the corresponding position in the reference block, that is, finds the corresponding reference data, so that the sub-target block can be matched at the position, for example, the reference data can be found.
  • the sub-target blocks are matched backwards. In this way, by predetermining an approximate location, the range of matching needs is reduced, the workload of the system is greatly reduced, the data compression time is saved, the efficiency of data compression is improved, and the system performance is also improved.
  • each target block has a size of 8K bits, that is, the total size of the target block is 400Kbit, and there is one reference block whose size is 8Kbit.
  • the following describes a method for encoding each target block in the embodiment of the present invention. After the 50 target blocks are encoded according to the method in the embodiment of the present invention, for example, the compression ratio is 50%, The size of the compression result is 200Kbit code sequence plus 8Kbit reference block, which is much smaller than 400Kbit, saving storage space.
  • the target block refers to a data block having a size of 8 Kbit
  • the reference block refers to a reference block having a size of 8 Kbit.
  • the size of the target block and the reference block are generally the same, and the size of the reference block and the target block in practice. It can be defined by itself, here is only an example; the target data is one of the data in the target block, the reference data is one of the reference blocks, and the sub-target block is composed of consecutive n-bit target data in the target block, n is A positive integer.
  • an embodiment of the present invention provides an encoding method, and a flow of the method is described as follows.
  • Step 101 Acquire a first sub-target block, where the first sub-target block belongs to a target block;
  • Step 102 Perform a hash operation on the first sub-target block to obtain a first key value, and query the first hash table according to the first key value, and the hash value corresponding to the key value of the first hash table indicates reference data in the reference block. If the first hash value corresponding to the first key value is found in the first hash table, the first reference data corresponding to the address indicated by the first hash value is obtained according to the first hash value, The first target data of the first sub-target block is matched with the first reference data, and the second target data in the target block is matched with the second reference data in the reference block, and the second target data is located in the first sub-target block. Other target data after the first target data, the second reference data is other reference data in the reference block after the first reference data;
  • Step 103 Generate a first coding sequence according to a matching result of the first target data and the second target data, where the first coding sequence includes a matching length and an offset, and the matching length is used to indicate the target data that is successfully matched. Length, the offset is used to indicate the location of the data that matches the target data that was successfully matched this time.
  • the first sub-target block may include consecutive n-bit data in any one of the target blocks to be compressed, and n is a positive integer.
  • n is a positive integer.
  • the first sub-target block includes the first quantity of consecutive data, that is, the first sub-object block may be regarded as the first A combination of data consisting of a number of consecutive data, or a segment of data. That is, the first number is n.
  • the first sub-target block here is a part of continuous data in one target block, that is, a sub-target block in one target block, that is, the actual one target block may include more data. (The amount of data included is greater than n).
  • the first sub-target block includes 4 consecutive data in one of the target blocks.
  • the first sub-target block in the embodiment of the present invention is only a title, and does not represent an actual situation.
  • the first sub-target block may be a complete target block, or the first sub-object block may be a sub-target block in a complete target block.
  • the data included in the first sub-target block may be hashed according to a hash operation algorithm.
  • the hash algorithm here may be a predetermined hash algorithm, and the hash algorithm is the same algorithm as the hash algorithm corresponding to the first hash table.
  • the hash table is, for example, in the form of a Key-Value.
  • the data in the first sub-target block is hashed to obtain a Key.
  • Key and Value are one by one.
  • Corresponding relationship, according to the obtained Key can query the corresponding Value in the first hash table.
  • the Value may be used to indicate the distance between the address of the first reference data corresponding to the Value and the first address of the reference block.
  • a hash value can be considered to correspond to a data combination, such as a sub-target block or a reference data block, and a reference data block includes n-bit reference data.
  • a hash value may also be considered to correspond to a data, such as corresponding to the first data in a sub-object block or the first data in a corresponding reference block.
  • the process of searching the first hash table may be: performing a hash operation according to the data in the first sub-target block to obtain the first key, and searching for the same key in the first hash table as the first key, if If the first key is found in a hash table, the first key in the first hash table corresponds to the first value, and the first value may be regarded as the first corresponding to the first key corresponding to the first sub-target block. Hash value.
  • the first sub-target block in the target block includes BCFG, and the BCFG is four consecutive data.
  • each data in the first sub-target block is one character
  • the actual first sub-target block is in the first sub-target block.
  • Each of the data may be a character, a number, or other possible data type
  • the first sub-target block is hashed to obtain a key value, for example, a first key value, for the first key. Value, you can query the corresponding hash value in the first hash table (for example, called the first hash value).
  • the same key is searched in the first hash table, as long as the same key as the first key is found in the first hash table (for example, The second key in the first hash table corresponds to a value (the first hash value), and the second key that is queried to the corresponding first hash value can be determined to be the first
  • the first reference data matched by the hash value the first reference data is located in the reference block, so that the first target data of the first sub-target block can be matched with the first reference data, and the second target data in the target block is referenced
  • the second reference data in the block is matched, that is, after determining the first reference data, the target data located in the first sub-target block and the target block after the first sub-target block and the first reference data may be And the other reference data located in the reference block after the first reference data is backward matched (or may also be referred to as backward matching) until the matching cannot be matched, and the matching result
  • the data included in the first sub-target block is BCFG, and the BCFG is four consecutive data.
  • the reference block is AEBCFGHIJKLMN.
  • the first sub-target block can be started from the data B, and the data B in the reference block and other data after the data B are matched.
  • the BCFG in the first sub-target block can be successfully matched. Because the solution provided by the embodiment of the present invention is backward matching, after the first sub-target block is successfully matched, the first block in the target block is located first. The other target data after the sub-target block can continue to be matched.
  • the target block is BCFGHIJKPQA
  • the first sub-target block is BCFG
  • other target data starting from the data H can be continued. Match until it can't match.
  • the matching result obtained by the final matching is BCFGHIJK, which is the first matching result in the present application file.
  • the data included in the first sub-target block is BCFG
  • the BCFG is four consecutive data
  • the reference block is AEBCFQHIJKLMN.
  • the first sub-target can be The block starts from data B and matches data B in the reference block and other data after data B.
  • the BCF in the BCFG in the first sub-target block can be successfully matched, but not from the data G. If the matching, then the other target data in the target block after the first sub-target block can not be matched naturally. Therefore, in this example, the matching result obtained by the final matching is BCF, which is the first matching result in the present application file.
  • the first reference data herein may be the first data in the reference data block that participates in the operation to obtain the first hash value in the first hash table, that is, the data corresponding to the first address in the reference data block, or
  • the first reference data herein may also refer to the reference data block that participates in the operation to obtain the first hash value in the first hash table.
  • the amount of data included in one sub-target block participating in the hash operation is not limited in the present invention, and can be set according to specific needs. For example, if the sub-target block is larger, the more data that participates in the hash operation each time, the smaller the workload of the system, the shorter the time required for the matching process, and the smaller the sub-target block, if each time participates in the hash. The less data is calculated, the more precise the matching process is, and the matching results can be more accurate.
  • the offset included in the first coding sequence may be used to indicate the location of the reference data in the reference block that matches the target data of the current matching success in the target block.
  • the method before querying the first hash table according to the first key value, the method further includes:
  • the reference data block is obtained from the reference block; each reference data block includes n-bit reference data, the first sub-target block includes n-bit target data, and n is a positive integer;
  • the first hash table is constructed, and the key value of the first hash table is obtained by a hash operation of the reference data block.
  • the first hash table may be pre-built according to the reference data in the known reference block, wherein each reference data block participating in the hash operation includes n-bit reference data.
  • each reference data block participating in the hash operation includes n-bit reference data.
  • the data amount of the reference data participating in the hash operation each time, and the data amount of the template data participating in the operation each time the target block is matched ie, one sub-target block
  • the amount of data included in the data must be equal, otherwise there may be cases where the length of the obtained operation result is different and cannot be matched.
  • the hash algorithm is, for example, a rule for performing operations by a golden partition prime number, that is, in this embodiment, a Key indicates that a reference data block is passed.
  • Value represents the distance between the first address of a reference data block and the first address of the reference block, which is equivalent to indicating the location of the reference data block in the reference block.
  • the value of n can be set as needed, and the present invention is not limited, and for example, it can be set to 4.
  • Value p-head
  • p refers to the position where the reference data block of the Key value corresponding to the Value is calculated in the reference block (for example, the reference data block for calculating the Key value corresponding to the Value)
  • the first address) head is the position of the first data of the reference block (that is, the first address of the reference block), and the difference between the two addresses is the first reference data block for which the Key value corresponding to the Value is calculated.
  • the distance between the data and the first data of the reference block which is the value of Value.
  • the hash table of the right size is very important.
  • Value the general block size of Delta block compression is 4Kbit or 8Kbit, so you can use 2Byte (byte) to represent Value (the maximum block size allowed is 64Kbit), so that it is compatible with bytes when it is enough. Alignment does not exist for byte redundancy.
  • Off (offset) the distance between the current data and the reference data can be represented by 15 bits, so the 2Byte Value design here is also sufficient.
  • the capacity of the L1 cache (first level cache) of the CPU (Central Processing Unit) of the general server is at least 32 Kbit
  • the coding scheme in the embodiment of the present invention includes two hash tables, and thus each hash table
  • the size can be designed to be 16Kbit, which ensures better speed performance. Therefore, the number of keys in the first hash table can be designed to be 8Kbit, and the Key is less than 8Kbit.
  • a sampling technique may be adopted, which may further speed up, for example, each time during sampling. Shift the first step of a long data p.
  • the reference block to be built into the first hash table includes the data of WILMBCFGAB, assuming that the data for each hash operation is four, that is, if n is 4, then the hash value is filled to fill the first hash.
  • the reference data blocks such as WILM, ILMB, LMBC, MBCF, BCFG, CFGA, and FGAB included in the reference block may be separately hashed. If a reference data block is selected by sampling, for example, a reference data block is selected every 2 data, that is, step is 2, then the reference data block participating in the hash operation in the reference block includes: WILM, MBCF, and FGAB. . It can be seen that the method of selecting the reference data block by sampling reduces the number of reference data blocks in the reference block to be involved in constructing the first hash table, thereby reducing the workload of the system.
  • the first row in FIG. 2 represents the data included in the reference block
  • the second row and the third row represent the calculation process of Key and Value, respectively
  • the last row represents the obtained first hash table.
  • the first hash table for example, when the Key is 0, the corresponding Value is Value0, and when Key is 1, the corresponding Value is For Value1, and so on.
  • the first number is n
  • the first address of the reference block is head, that is, the address where the data a in FIG. 2 is located.
  • sampling is used to select the reference data block.
  • the first step is expressed by step.
  • the corresponding address can be head+n*step.
  • the first hash table is obtained.
  • p is the position of the reference data block in which the reference data value corresponding to the value is calculated in the reference block (for example, the first address of the reference data block for calculating the Key value corresponding to the Value).
  • the embodiment of the present invention provides a first hash table, which replaces a large number of sub-target blocks with a reference block character matching process by first searching through the first hash table, and then finding a potential matching position of the sub-target block in the reference block, and then The target data located in the sub-target block and the target block after the sub-target block is back-matched with the reference data at the potential matching position and other reference data in the reference block after the reference data, which reduces the matching process.
  • System performance has been optimized by orders of magnitude.
  • the sampling technique can be used in the construction process of the first hash table, and the system performance is improved several times according to the sampling step size, and the missing matching loss caused by the sampling passes the backward matching in the matching process.
  • the process can be compensated.
  • the first hash table may be updated.
  • Key1 is calculated according to reference data block 1, and the value corresponding to reference data block 1 is Value1, and Key1 and Value1 are added to the first hash table.
  • Key1 is calculated for reference data block 2, but the value corresponding to reference data block 2 is Value2.
  • the value is updated to Value1, it means that in the process of searching for the first hash table for matching, the matching data closer to the first address of the reference block will be found. If Value1 is not updated by Value2, it indicates subsequent search. During the matching of the first hash table, matching data farther from the endpoint will be found.
  • the target data located after the first sub-target block in the first sub-target block and the target block is back-matched with the first reference data and other reference data located after the first reference data in the reference block, Can try to make up for the process in the process of building the first hash table Loss caused by sampling techniques.
  • the method before the generating the first coding sequence, the method further includes:
  • the target data in the target block before the first sub-target block is matched with other reference data in the reference block before the first reference data.
  • the target data before the first sub-target block in the first sub-target block and the target block and the first reference data and the reference block may be located before the first reference data.
  • the other reference data is forward matched (also known as forward matching) until it cannot match. Then in this case, when the first code sequence is generated, the first code sequence is generated based on the result of the backward match and the result of the forward match.
  • forward matching may be performed, and matching data may be made more to reduce subsequent matching times.
  • the data included in the target block is ABCDEFGHIJKLMNOPQRST, and n is set to 4.
  • a hash operation is performed on HIJK to obtain Key1, and Key1 is found in the first hash table, and the value corresponding to Key1 in the first hash table is Value1, and corresponding reference data is found in the reference block according to Value1.
  • the HIJK, and other target data located after the K in the target block are continuously matched with the reference data in the reference block, that is, backward matching.
  • the target data that is successfully matched in the backward matching process is HIJKL, that is, the target data M is not successfully matched.
  • the HIJK, and other target data before the H in the target block and the reference data in the reference block may be continuously matched starting from the reference data 1 in the reference block. , that is, forward matching.
  • the target data that is successfully matched in the forward matching process is FG (not counting HIJK), that is, the target data E is not matched successfully.
  • the final matching result can be obtained, for example, FGHIJKL, then the coding sequence can be generated according to FGHIJKL, that is, the second matching result in the present application file.
  • execution order of forward matching and backward matching can be arbitrary
  • the first coding sequence further includes target data that is not successfully matched, and the unmatched success is successful.
  • the target data is the target data between the last matching target data corresponding to the previous coding sequence and the first target data that is successfully matched.
  • the target data that is not matched successfully is: the target data between the last target data that is successfully matched forward and the last target data corresponding to the previous coding sequence adjacent to the first coding sequence.
  • the first coding sequence also includes a Lite bit, which can be used to store data that does not match successfully.
  • the target data included in the target block is ABCDEFGHIJKLMNOPQRST, and n is set to 4.
  • the final matching result obtained by forward matching and backward matching is FGHIJKL.
  • the coding sequence 1 is generated according to FGHIJKL.
  • the coding sequence 1 may include: the matching length is 7, and the offset is the distance between the reference data 1 and the first address of the reference block.
  • the coding sequence 1 may also include a Lit bit for storing unmatched target data, and the unmatched successful target data in the code sequence 1 is E, so that the ABCD in the target block can be acquired according to the previous adjacent coding sequence in the decoding process, and The EFGHIJK in the target block can be obtained according to the present coding sequence.
  • a second hash table may be constructed, where the second hash table is a hash table corresponding to the target block.
  • the method of constructing the second hash table may refer to the manner of constructing the first hash table.
  • the hash algorithm used may be the same as the hash algorithm of the first hash table, and In the second hash table, the amount of data of the target data participating in the hash operation may also be n.
  • a hash value corresponding to the key value obtained by hashing the sub-target block is found in both the first hash table and the second hash table;
  • a hash value corresponding to the key value obtained by hashing the sub-target block is not found in the first hash table, and the sub-object block is found in the second hash table.
  • the hash corresponding to the key value obtained by hashing the sub-target block is not found in the first hash table and the second hash table. value. 4.
  • the hash value corresponding to the key value obtained by hashing the sub-target block is found in the first hash table, and the hash of the sub-target block is not found in the second hash table.
  • the hash value corresponding to the calculated key value are the following cases;
  • the first case is a first case:
  • the method further includes:
  • the hash value corresponding to the key value in the second hash table indicates The address of the target data in the target block, the first target data is obtained according to the second hash value, the first target data of the first sub-target block is matched with the first target data, and the second target data and the target in the target block are matched.
  • the third target data in the block is matched, the second target data is other target data located after the first target data of the first sub-target block, and the third target data is other target data in the target block after the first target data And matching the target data in the target block before the first sub-target block with other target data in the target block before the first target data, to obtain the first matching result;
  • the first coding sequence further includes an indication bit, used to be the first If the matching target data amount in the matching result is greater than the second matching result, the data indicating that the target data successfully matched by the current matching is located in the target block, or is used to match the target data amount in the second matching result is greater than the first In the case of a matching result, the data indicating that the target data of the current matching is matched is located in the reference block.
  • the second hash table may also take the form of a Key-Value, for example, hashing the data in the first sub-target block to obtain a Key, and in the second hash table, the Key A one-to-one correspondence with Value, according to the obtained Key, the corresponding value can be queried in the second hash table.
  • the value may be used to indicate the distance between the first address of the data in the sub-target block corresponding to the Value and the first address of a specific target block, and the specific target block may be the first of all the target blocks.
  • a target block can also be other target blocks.
  • the second hash table may include both a Key and a Value, and the Key and the Value have a one-to-one correspondence.
  • the process of searching the second hash table may be: performing a hash operation according to the target data in the first sub-target block to obtain a Key (for example, referred to as a first key), and searching for the same key in the second hash table. If the same Key (for example, the second Key) is found in the second hash table, the second key in the second hash table corresponds to a Value (for example, referred to as the first Value), and then the first Value can be considered as the hash value corresponding to the first key.
  • the corresponding correspondence is queried in both the first hash table and the second hash table.
  • the hash value can be matched in the first hash table and the second hash table, respectively.
  • the first hash value corresponding to the first key value is queried in the first hash table, and the first reference data in the reference block is determined according to the first hash value, which is equivalent to determining a potential location. Then, the target data located in the first sub-target block and the target block after the first sub-target block is back-matched with the first reference data and other reference data in the reference block after the first reference data to compensate for the When a hash table is lost due to the sampling process, until the match is impossible, the matching result is called the first matching result.
  • the target data before the first sub-target block in the first sub-target block and the target block and other reference data in the first reference data and the reference block before the first reference data may be The forward matching is performed until the preceding and succeeding matches.
  • the matching result obtained at this time can be regarded as the maximum matching result of the current matching point, and the matching result can also be referred to as the first matching result.
  • the second hash value corresponding to the first key value is queried in the second hash table, according to The second hash value determines the corresponding third target data in the target block, and the target data located after the first sub-target block in the first sub-target block and the target block is located in the third target in the third target data and the target block.
  • the other target data after the data is backward matched until the matching cannot be matched, and the obtained matching result is referred to as the second matching result.
  • the target data in the first sub-target block and the target block before the first sub-target block and the third target data and other target data in the target block before the third target data may be The forward matching is performed until the preceding and succeeding matches.
  • the matching result obtained at this time can be regarded as the maximum matching result of the current matching point, and the matching result can be referred to as the second matching result.
  • the process of matching in the second hash table is similar to the process of matching in the first hash table as exemplified in the previous example, and no more examples are given here.
  • the first coding sequence may further include an indication bit, which is used to indicate whether the data corresponding to the target data corresponding to the first coding sequence is located in the reference block or in the target block, that is, used to indicate the selected matching result. Is the matching result obtained according to the first hash table or the matching result obtained according to the second hash table.
  • the offset in the first coding sequence is used to indicate the first reference data of the current matching success in the reference block.
  • the same hash value may appear in both the first hash table and the second hash table, so you may get two matching results when performing table lookup and matching according to the operation result, then you can choose The matching result with a large amount of data is encoded, which can improve the matching accuracy, and Increase the compression ratio.
  • the method further includes:
  • the second hash table is updated such that the hash value corresponding to the first key value in the second hash table indicates the address of the first target data of the first sub-target block.
  • sub-target blocks in the target block 1 which are sub-object block 1 and sub-object block 2
  • the target data included in the sub-object block 1 is ABCD
  • the target data included in the sub-object block 2 is also ABCD, that is, the two sub-target blocks.
  • sub-object block 1 is hashed, and Key1 is obtained.
  • the value corresponding to Key1 is Value1, and Key1 and Value1 are stored in the second hash table.
  • the sub-object block 2 is hashed, and the obtained Key is also Key1, but the positions of the sub-object block 1 and the sub-object block 2 in the target block are different, that is, the values corresponding to the sub-object block 1 and the sub-object block 2 It is different, for example, the value corresponding to sub-object block 2 is Value2. Then, you can choose to update Value1 in the second hash table with Value2.
  • the hash value corresponding to the first key value stored in the second hash table may be updated according to the hash value corresponding to the first key value acquired this time, where the update is performed.
  • the old hash value in the second hash table is replaced with a new hash value, so that the hash value in the second hash table is updated in time to improve the next matching success rate.
  • the second hash table may or may not be updated.
  • the same hash value may appear at most once in the first hash table and the second hash table, and the time performance is better, that is, the data compression can be shortened. time.
  • the second case is a first case
  • the method further includes:
  • the hash value corresponding to the key value in the second hash table indicates the address of the target data; according to the first key in the second hash table
  • the hash value corresponding to the value obtains the first target data, matches the first target data of the first sub-target block with the first target data, and performs the second target data in the target block with the third target data in the target block.
  • the second target data is other target data located after the first target data of the first sub-target block
  • the third target data is other target data of the target block after the first target data
  • the first coding sequence further includes an indication bit for indicating that the data matching the target data that is successfully matched this time is located in the target block.
  • the corresponding hash value is queried in the second hash table, and at the first If the corresponding hash value is not queried in the hash table, the matching is performed in the second hash table.
  • a hash value corresponding to the operation result (for example, a second hash value) is queried in the second hash table, and the corresponding first target data in the target block is determined according to the second hash value, which is equivalent to Determining a potential location, the target data after the first sub-target block in the first sub-target block and the target block is backwardd with the first target data and other target data in the target block after the first target data Matching, the first coding sequence can be generated according to the obtained matching result.
  • the target data before the first sub-target block in the first sub-target block and the target block may be compared with the first target data and other target data in the target block before the first target data. The forward matching is performed until the matching is not performed before and after.
  • the matching result obtained at this time can be regarded as the maximum matching result of the current matching point, and the first coding sequence can be generated according to the matching result.
  • the method further includes:
  • the second hash table is updated such that the hash value corresponding to the first key value in the second hash table indicates the address of the first target data of the first sub-target block.
  • the second hash table may be updated, or the second hash table may not be updated.
  • the same hash value may appear at most once in the first hash table and the second hash table, and the time performance is better, that is, the data compression can be shortened. time.
  • the third case is a first case.
  • the method further includes:
  • Generating a first coding sequence according to the obtained matching result including:
  • the hash value corresponding to the first key value is not found in the second hash table, updating the second hash table, so that the hash value corresponding to the first key value in the second hash table indicates the first child The address of the first target data of the target block.
  • the hash value corresponding to the first key value may be inserted into a corresponding position in the second hash table, that is, the second hash table is updated according to the hash value corresponding to the first key value, so that When the first key value occurs again next time, the corresponding hash value can be found in the second hash table.
  • the method further includes:
  • the hash value corresponding to the first key value is not found in the second hash table, updating the second hash table, so that the hash value corresponding to the first key value in the second hash table indicates the first child The address of the first target data of the target block.
  • the corresponding hash value is queried in the first hash table, and in the second If the corresponding hash value is not queried in the hash table, the matching is performed in the first hash table, and the first coding sequence is generated according to the matching result obtained by the first hash table matching, wherein the matching process and the generation A process such as a coding sequence can be referred to as described above.
  • the hash value corresponding to the first key value may be inserted into a corresponding position in the second hash table, that is, the second hash table is updated according to the hash value corresponding to the first key value, so that When the first key value occurs again next time, the corresponding hash value can be found in the second hash table.
  • a first hash table may be generated according to the reference block.
  • the specific generation process refers to the foregoing part of the embodiment, and then starts to
  • the target block is encoded, for example, the target block is ABCDEFGHIJKABCD, with step 1 and n being 4; firstly, the sub-object block ABCD in the target block is hashed to obtain the key value corresponding to ABCD, in the first
  • the query in the table if the same key value is queried in the first hash table, the encoding sequence may be output according to the foregoing part of the embodiment.
  • the key value corresponding to the ABCD is stored in the second hash table, and the hash value corresponding to the key value in the second hash table is the address corresponding to the data A. Then, after the sub-target block BCDE is acquired, it is hashed to obtain the corresponding key value and queried in the first hash table and the second hash table, and the coding sequence is output according to the result of the query, and the specific processing procedure Refer to the foregoing for querying the contents of the four cases of the two hash tables in this embodiment.
  • the sub-object block ABCD appears again, and the key value corresponding to the sub-object block is still not found in the first hash table, but due to the sub-object block ABCD
  • the corresponding key value and the hash value have been stored in the second hash table, so when the sub-object block ABCD appears again and seeks a match, the corresponding hash value can be queried in the second hash table,
  • the use of the second hash table enables the sub-target block to be processed later in the target block to match the sub-object block that has been processed in the target block, and no longer only matches the reference data in the reference block. , further improve the efficiency of coding and the success rate of matching.
  • a coding sequence may be generated according to the matching result, and the coding sequence may be output to complete the coding process.
  • FIG. 3 is a possible schematic diagram of a coding sequence provided in an embodiment of the present invention.
  • Head denotes a sequence header, which is a start node of the entire coding sequence
  • LLen denotes a character length, which is a possible field in the coding sequence, and is a supplement to the case where the length of the character in the Head is full.
  • Lit represents a character, which may be a field in the coding sequence, and is used to store unmatched characters in the target block, that is, unmatched target data.
  • Off represents an offset, which is a field that must exist in the coding sequence, and Off may include an indication bit and an offset as described above.
  • MLen indicates the matching length, which is a possible field in the encoding sequence, and is a supplement to the case where the matching length in the Head is full.
  • the Head may be, for example, a fixed length, for example, the length may be 1 Byte.
  • the high h1bit can be used to store the character length.
  • the height of the h1 bit in the Head is stored as follows.
  • h1 is 16:
  • the character length refers to the length of the target data that has not been matched between the last matching end and the start of the current matching. 15+Llen means that when the character length reaches 15, a new space is opened for storing the extra character length, then the length of the character exceeding 15 is the length of the extra character expressed by Llen.
  • the low h2bit can be used to store the matching length.
  • the low h2 bits in the Head are stored as follows.
  • h2 is 16:
  • the matching length refers to the length of the consecutively matched bytes, that is, the data length of the current matching.
  • 15+Llen means that when the matching length reaches 15, a new space is opened for storing the matching length, and then the matching length of more than 15 is the length of the matching, which is represented by Llen.
  • h1 and h2 are only examples, and can be allocated as needed.
  • the space for storing the character length and the matching length in the Head is variable, that is, the new space can be added according to the actual length. Then, for example, if the character length is short, the occupied storage space of the Head is compared. Less, the data compression rate can be increased, and if the character length is long, the space of the Head is also variable, which can satisfy various needs.
  • LLen is used to store the character length.
  • the length of LLen is variable, for example, the variable range is [0, n] Byte.
  • 1 Byte is allocated as LLen, which is used to store the character length.
  • 1 Byte is allocated as LLen again, for The length of the character is stored until the value of the allocated Byte as LLen is less than 255.
  • the length of Lit is, for example, variable, and the variable range is, for example, [0, n] Byte.
  • Lit is used to store the original data of the target block, that is, the target data in the target block that does not match successfully. For example, if the character QW in the target block does not match successfully, the QW can be stored in Lit.
  • the length of Lit can be, for example, the sum of the high h1 bit of the Head and the number of bits of the LLen.
  • the length of Off is, for example, a fixed length, and the length is, for example, 2 Bytes.
  • the upper 1 bit of Off can be used to store the type of matching block, that is, whether the data segment that successfully matches a target block is the data segment in the reference block or the data segment in the target block, for example:
  • the lower 15 bits of Off can be used to store the offset, for example:
  • the offset is used to represent The target data that matches the successfully matched target data is located in the target block;
  • the offset is used to indicate the location of the reference data in the reference block that matches the target data that is successfully matched.
  • the length of MLen is, for example, variable, and the variable range is, for example, [0, n] Byte.
  • 1 Byte can be allocated as MLen, and the character length is stored.
  • 1 Byte is allocated again as MLen for storage.
  • the length of the character until the value of the allocated Byte as MLen is less than 255.
  • the coding sequence provided in the embodiment of the present invention is applied to the fields of coding (for example, may be Delta coding), and has the advantages of high compression ratio and good performance.
  • the first coding sequence it may be determined according to the indication bit in the Off of the first coding sequence whether the data of the matching success is in the reference block or the target block, for example, the value of the indication bit in Off is 1, for example When the indication bit is 1, it indicates that the reference block is indicated, and it can be determined that the data successfully matched with the first coding sequence is located in the reference block.
  • the offset in Off the position of the first reference data that is successfully matched with the first coding sequence in the reference block can be known, for example, the reference data included in the reference block is ABCDEFGHIJKLMNOPQRST, and the reference data indicated by the offset in Off Is the reference data I among them.
  • the number of bits of the successfully matched data is 6, and the number of bits of the unmatched data is 2, then it can be learned according to the reference block.
  • the successfully matched data is IJKLMN, and the unmatched data can be obtained according to the Lit of the first coding sequence, for example, WK, and the target data corresponding to the first coding sequence can be obtained as WKIJKLMN.
  • FIG. 4 is a flowchart of an encoding method according to an embodiment of the present invention.
  • Step 1 Obtain a first sub-target block.
  • the first sub-target block may be, for example, any one of the 10 target blocks, for example, n-bit target data is included in the first sub-target block.
  • Step 2 Perform a hash operation on the first sub-target block to obtain a key value of 1, and query the first hash table and the second hash table respectively according to the key value 1. If the hash value 1 corresponding to the key value 1 is queried in both the first hash table and the second hash table, step 3 is performed, and if the hash value corresponding to the key value 1 is queried in the first hash table, If the hash value 1 corresponding to the key value 1 is not found in the second hash table, step 7 is performed, and if the hash value corresponding to the key value 1 is not found in the first hash table, 1.
  • step 8 is performed, and if the first hash table and the second hash table are not queried, the key value 1 is not queried. If the hash value is 1, go to step 9.
  • Step 3 Match the first sub-object block according to the first hash value queried in the first hash table and the second hash value queried in the second hash table, to obtain two matching results.
  • matching is performed, only backward matching may be performed, or both forward matching and backward matching may be performed.
  • the second hash value in the second hash table may also be updated according to the hash value corresponding to the first key value.
  • the matching process can be referred to the previous description.
  • Step 4 Select a matching result with a large amount of data from the two matching results, for example, the matching result 1 is selected.
  • Step 5 Generate and output a code sequence 1 according to the determined matching result.
  • Code sequence 1 package For the content and format, etc., refer to the related description in the flow of FIG. 1.
  • the determined matching result may be the matching result 1, or may be the matching result 2 to be introduced later. Go to step 6.
  • Step 6 Determine whether there are still sub-target blocks to be matched. If there are still sub-target blocks to be matched, step 1 is performed, and if there is no sub-target block to be matched, the process ends.
  • Step 7 Match the first sub-object block according to the first hash value queried in the first hash table to obtain a matching result, for example, a matching result 1. Go to step 5.
  • the matching result may be a matching result obtained by a backward matching, or may also be a matching result obtained by a forward matching plus a backward matching.
  • the second hash value in the second hash table may also be updated according to the hash value corresponding to the first key value.
  • Step 8 Match the first sub-object block according to the second hash value queried in the second hash table to obtain a matching result, for example, a matching result 2. Go to step 5.
  • the matching result may be a matching result obtained by a backward matching, or may also be a matching result obtained by a forward matching plus a backward matching.
  • the second hash value in the second hash table may also be updated according to the hash value corresponding to the first key value.
  • Step 9 Update the second hash value in the second hash table according to the hash value corresponding to the first key value.
  • the matching process involved in the process of FIG. 4, the process of generating a code sequence, the information included in the code sequence, and other contents not described in detail may refer to the related description in the flow of FIG.
  • an embodiment of the present invention provides an encoding apparatus, where the encoding apparatus may include an obtaining module 501 and a processing module 502, where the encoding apparatus may be used to implement the process of FIG. The method described in the flow of Figure 2.
  • the obtaining module 501 can be used to obtain the first sub-target block, and can be executed by the processing module 502 for the flow of FIG. 1 and other steps in the process of FIG. 2 . Since each step has been described in the foregoing embodiments, the details of the functions performed by the obtaining module 501 and the processing module 502, and the corresponding execution processes may be referred to the previous embodiment. description of.
  • an embodiment of the present invention provides an encoding apparatus, which may include a memory 601 and a processor 602, and the encoding apparatus may be the same as the encoding apparatus in FIG. device.
  • the processor 602 may be, for example, a CPU (Central Processing Unit) or an ASIC (Application Specific Integrated Circuit), and may be one or more integrated circuits for controlling program execution, and may be an FPGA (Field Programmable Gate Array).
  • the hardware circuit developed by the field programmable gate array can be a baseband chip.
  • the number of memories 601 may be one or more.
  • the memory 601 may include a ROM (Read Only Memory), a RAM (Random Access Memory), and a disk storage.
  • the memory 601 can be connected to the processor 602 via a bus (as shown in FIG. 6 as an example), or can be connected to the processor 602 through a dedicated connection line.
  • the code corresponding to the method shown above is solidified into the chip, thereby enabling the chip to perform the method shown in the previous embodiment while it is running.
  • How to design and program the processor 602 is a technique well known to those skilled in the art, and details are not described herein.
  • the memory 601 can be used to store instructions required by the processor 602 to perform tasks, and the processor 602 can implement the acquisition module 501 and the processing module 502 in FIG. 5 by executing instructions stored in the memory 601, that is, for the flow of FIG. And other various steps in the flow of FIG. 2 can be performed by processor 602. Since each step has been described in the foregoing embodiments, the details of the functions performed by the processor 602, the corresponding execution process, and the like can be referred to the description of the previous embodiment.
  • a sub-target block in the target block when a sub-target block in the target block is obtained (for example, referred to as a first sub-target block), it is hashed first, and then the corresponding hash is searched in the first hash table according to the operation result.
  • the hash value is then found in the reference block according to the hash value of the query, that is, the first reference data is found, so that the first sub-target block can be matched backward at the position.
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit or unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • a computer readable storage medium including a number of instructions to make a computer device (can be a personal computing A machine, server, or network device, or the like, or a processor, performs all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes various media that can store program codes, such as a USB flash drive, a mobile hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Storage Device Security (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种编码方法及装置,用于解决在采用Delta编码时数据压缩效率较低的技术问题。该方法中,在得到目标块中第一子目标块时,先将其进行哈希运算,然后根据运算结果在第一哈希表中查询对应的哈希值,再根据查询到的哈希值在参考块中找到相应的位置,即找到第一参考数据,将第一子目标块的首个目标数据与第一参考数据进行匹配,将目标块中的第二目标数据与参考块中的第二参考数据进行匹配。这样,通过预先确定出一个大概的位置,缩小了需要匹配的范围,节省了数据压缩的时间,提高了数据压缩的效率。

Description

一种编码方法及装置 技术领域
本发明涉及数据压缩技术领域,特别涉及一种编码方法及装置。
背景技术
Delta算法是无损数据压缩技术中的一种,用于计算一个新文件和一个已经存储在系统中的参考文件之间的Delta编码。比如,在需要存储一个新文件时,将新文件和已经存储在系统中的多个参考文件分别进行匹配,如果新文件和其中一个参考文件之间的相似性超过了预先设定的阀值时,就计算出一个与该新文件对应的Delta编码,只须把这个Delta编码存储在系统里即可,无需存储新文件本身。在恢复新文件时,根据与新文件相似的参考文件,以及新文件对应的Delta编码,就可以恢复出新文件。这样,在存储具有相似性的文件时,使用Delta编码进行文件压缩,可以极大地节省存储空间。
目前,XDelta编码是一种常用的Delta编码算法。XDelta编码的核心思想就是在参考块中寻找是否有与目标块匹配的子块,例如一般连续3个或4个字节相同则认为匹配成功。
然而,XDelta编码在进行匹配时,是按照逐字节的匹配方式,如果参考块的数据量比较大的话,那么匹配过程的工作量就比较大,耗费的时间也比较多,导致数据压缩的效率较低。
发明内容
本申请提供一种编码方法及装置,用于解决在采用Delta编码时数据压缩效率较低的技术问题。
第一方面,提供一种编码方法,包括:
获取第一子目标块,所述第一子目标块属于目标块;
对所述第一子目标块进行哈希运算得到第一键值,根据所述第一键值查 询第一哈希表,所述第一哈希表的键值对应的哈希值指示参考块中参考数据的地址;若在所述第一哈希表中查到与所述第一键值对应的第一哈希值,根据所述第一哈希值获取该第一哈希值指示的地址对应的第一参考数据,将所述第一子目标块的首个目标数据与所述第一参考数据进行匹配,将所述目标块中的第二目标数据与所述参考块中的第二参考数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第二参考数据为所述参考块中位于所述第一参考数据之后的其他参考数据;
根据所述首个目标数据及所述第二目标数据的匹配结果生成第一编码序列;其中,所述第一编码序列中包括匹配长度和偏移量,所述匹配长度用于指示本次匹配成功的目标数据的长度,所述偏移量用于指示与本次匹配成功的目标数据相匹配的数据所在的位置。
本申请中,在得到目标块中的一个子目标块时(例如称为第一子目标块),先将其进行哈希运算,然后根据运算结果在第一哈希表中查询对应的哈希值,再根据查询到的哈希值在参考块中找到相应的位置,即找到第一参考数据,从而可以将第一子目标块在该位置处进行向后匹配(即,将第一子目标块的首个目标数据与第一参考数据进行匹配,将目标块中的第二目标数据与参考块中的第二参考数据进行匹配)。这样,通过预先确定出一个大概的位置,缩小了需要匹配的范围,节省了数据压缩的时间,提高了数据压缩的效率。
结合第一方面,在第一方面的第一种可能的实现方式中,所述根据所述第一键值查询第一哈希表之前,还包括:
按照第一步长,从所述参考块中获取参考数据块;每个参考数据块包括n位参考数据,所述第一子目标块包括n位目标数据,所述n为正整数;
构建所述第一哈希表,所述第一哈希表的键值为所述参考数据块通过所述哈希运算获取的。
即,可以根据已知的参考块中的参考数据预先构建第一哈希表,以备后续使用。本申请中,构建第一哈希表时,每次参与哈希运算的参考数据的数 据量,与将目标块进行匹配时每次参与运算的模板数据的数据量(即一个子目标块所包括的数据的数据量)需相等,否则可能会出现得到的运算结果的长度不同而无法进行匹配的情况。
结合第一方面或第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,所述生成第一编码序列前还包括:将所述目标块中位于所述第一子目标块之前的目标数据与所述参考块中位于所述第一参考数据之前的其他参考数据进行匹配。
即,除了如前提到的后向匹配之外,还可以将第一子目标块及目标块中位于第一子目标块之前的目标数据与第一参考数据及参考块中位于第一参考数据之前的其他参考数据进行前向匹配,直到无法匹配为止。这样,可以一次匹配尽量多的目标数据,以有效减少匹配次数,也减少后续生成的编码序列的数量,减轻系统的负担。
结合第一方面的第二种可能的实现方式,在第一方面的第三种可能的实现方式中,所述第一编码序列中还包括未匹配成功的目标数据,所述未匹配成功的目标数据为上一个编码序列对应的匹配成功的最后一个目标数据,与本次匹配成功的第一个目标数据之间的目标数据。
在匹配时,可能会存在无法匹配的数据,即未匹配成功的数据,那么,为了在解码时能够尽量准确地恢复原来的目标数据,需要将未匹配成功的数据也记录下来,本申请中的编码序列即提供了用于存储未匹配成功的目标数据的字段,以使得后续进行解码时得到的解码结果尽量与原目标数据一致。
结合第一方面或第一方面的第一种可能的实现方式至第三种可能的实现方式中的任一种可能的实现方式,在第一方面的第四种可能的实现方式中,
所述对所述第一子目标块进行所述哈希运算得到第一键值后,还包括:
根据所述第一键值查询第二哈希表;若在所述第二哈希表中查到与所述第一键值对应的第二哈希值,所述第二哈希表中的键值对应的哈希值指示所述目标块中目标数据的地址,根据所述第二哈希值获取第一目标数据,将所 述第一子目标块的首个目标数据与所述第一目标数据进行匹配,将所述目标块中的第二目标数据与所述目标块中的第三目标数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第三目标数据为所述目标块中位于所述第一目标数据之后的其他目标数据,并将所述目标块中位于所述第一子目标块之前的目标数据与所述目标块中位于所述第一目标数据之前的其他目标数据进行匹配,获取第一匹配结果;
所述将所述第一子目标块的首个目标数据与所述第一参考数据进行匹配,将所述目标块中的第二目标数据与所述参考块中的第二参考数据进行匹配的匹配结果为第二匹配结果;所述生成第一编码序列,包括:
从所述第一匹配结果和所述第二匹配结果中选择匹配的目标数据的数据量大的匹配结果;根据选择的匹配结果生成所述第一编码序列;所述第一编码序列中还包括指示位,用于当所述第一匹配结果中匹配的目标数据量大于所述第二匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数据位于所述目标块,或用于当所述第二匹配结果中匹配的目标数据量大于所述第一匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数据位于所述参考块。
即本申请中提供了第一哈希表和第二哈希表,对于一个键值,可以在这两个哈希表中进行查询匹配。同一个哈希值在第一哈希表和第二哈希表中可能都会出现,所以在根据运算结果进行查表及匹配时可能会得到两个匹配结果,那么可以选择数据量较大的匹配结果进行编码,这样做可以提高匹配准确度,及提高压缩率。
结合第一方面或第一方面的第一种可能的实现方式至第三种可能的实现方式中的任一种可能的实现方式,在第一方面的第五种可能的实现方式中,若在所述第一哈希表中未查到与所述第一键值对应的哈希值,所述生成第一编码序列前还包括:
根据所述第一键值查询第二哈希表;
若在所述第二哈希表中查到与所述第一键值对应的哈希值,所述第二哈希表中键值对应的哈希值指示目标数据的地址;根据所述第二哈希表中所述第一键值对应的哈希值获取第一目标数据,将所述第一子目标块的首个目标数据与所述第一目标数据进行匹配,将所述目标块中的第二目标数据与所述目标块中的第三目标数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第三目标数据为所述目标块中位于所述第一目标数据之后的其他目标数据;
所述第一编码序列中还包括指示位,用于指示与本次匹配成功的目标数据相匹配的数据位于所述目标块。
本申请中提供了第一哈希表和第二哈希表,对于一个键值,可以在这两个哈希表中进行查询匹配。如果一个哈希值只在其中一个哈希表中出现,则在根据运算结果进行查表及匹配时可能只得到一个匹配结果,那么就可以直接对该匹配结果进行编码即可。考虑到这种情况,因此本申请中的编码序列中提供指示位,可以指示与本次匹配成功的目标数据相匹配的数据是位于目标块还是参考块,这样在后续解码时就能够准确找到与该编码序列对应的目标数据匹配成功的目标数据相匹配的数据是位于哪里,从而提高解码的准确率。另外,直接在编码序列中就可以通过指示位来指示数据的位置,无需再另外耗费存储空间用于记录该信息,可以有效节省存储空间,也方便解码时直接使用,提高解码效率。
结合第一方面的第四种可能的实现方式或第五种可能的实现方式,在第一方面的第六种可能的实现方式中,还包括:
更新所述第二哈希表,使得所述第二哈希表中所述第一键值对应的哈希值指示所述第一子目标块的首个目标数据的地址。
即,可以根据本次的第一键值对应的哈希值来更新第二哈希表中存储的的第一键值对应的哈希值,这里的更新,可以是指用新的哈希值替换第二哈希表中的旧的哈希值,这样可以保证第二哈希表中的哈希值得到及时更新, 提高下一次的匹配成功率。
结合第一方面或第一方面的第一种可能的实现方式至第三种可能的实现方式中的任一种可能的实现方式,在第一方面的第七种可能的实现方式中,
在根据所述第一键值查询第一哈希表之后,若在所述第一哈希表中未查到与所述第一键值对应的哈希值,则根据所述第一键值查询第二哈希表;
若在所述第二哈希表中未查到与所述第一键值对应的哈希值,则更新所述第二哈希表,使得所述第二哈希表中所述第一键值对应的哈希值指示所述第一子目标块的首个目标数据的地址。
结合第一方面或第一方面的第一种可能的实现方式至第三种可能的实现方式中的任一种可能的实现方式,在第一方面的第八种可能的实现方式中,
在对所述第一子目标块进行哈希运算得到第一键值之后,根据所述第一键值查询第二哈希表;
若在所述第二哈希表中未查到与所述第一键值对应的哈希值,则根据更新所述第二哈希表,使得所述第二哈希表中所述第一键值对应的哈希值指示所述第一子目标块的首个目标数据的地址。
即,如果在第二哈希表中未查询到与第一键值对应的哈希值,可以将该第一键值对应的哈希值插入第二哈希表中的相应位置,即根据该第一键值对应的哈希值更新第二哈希表,这样,在下次再出现该第一键值时,就可以在第二哈希表中查找到对应的哈希值。
第二方面,提供一种编码装置,包括:
存储器,用于存储指令;
处理器,用于通过执行所述指令:
获取第一子目标块,所述第一子目标块属于目标块;
对所述第一子目标块进行哈希运算得到第一键值,根据所述第一键值查询第一哈希表,所述第一哈希表的键值对应的哈希值指示参考块中参考数据的地址;若在所述第一哈希表中查到与所述第一键值对应的第一哈希值,根 据所述第一哈希值获取该第一哈希值指示的地址对应的第一参考数据,将所述第一子目标块的首个目标数据与所述第一参考数据进行匹配,将所述目标块中的第二目标数据与所述参考块中的第二参考数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第二参考数据为所述参考块中位于所述第一参考数据之后的其他参考数据;
根据所述首个目标数据及所述第二目标数据的匹配结果生成第一编码序列;其中,所述第一编码序列中包括匹配长度和偏移量,所述匹配长度用于指示本次匹配成功的目标数据的长度,所述偏移量用于指示与本次匹配成功的目标数据相匹配的数据所在的位置。
结合第二方面,在第二方面的第一种可能的实现方式中,所述处理器还用于:
在所述根据所述第一键值查询第一哈希表之前,按照第一步长,从所述参考块中获取参考数据块;每个参考数据块包括n位参考数据,所述第一子目标块包括n位目标数据,n为正整数;
构建所述第一哈希表,所述第一哈希表的键值为所述参考数据块通过所述哈希运算获取的。
结合第二方面或第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述处理器还用于:
在所述生成第一编码序列之前,将所述目标块中位于所述第一子目标块之前的目标数据与所述参考块中位于所述第一参考数据之前的其他参考数据进行匹配。
结合第二方面的第二种可能的实现方式,在第二方面的第三种可能的实现方式中,所述第一编码序列中还包括未匹配成功的目标数据,所述未匹配成功的目标数据为上一个编码序列对应的匹配成功的最后一个目标数据,与本次匹配成功的第一个目标数据之间的目标数据。
结合第二方面或第二方面的第一种可能的实现方式至第三种可能的实现方式,在第二方面的第四种可能的实现方式中,所述处理器还用于:
在所述对所述第一子目标块进行所述哈希运算得到第一键值之后,根据所述第一键值查询第二哈希表;若在所述第二哈希表中查到与所述第一键值对应的第二哈希值,所述第二哈希表中的键值对应的哈希值指示所述目标块中目标数据的地址,根据所述第二哈希值获取第一目标数据,将所述第一子目标块的首个目标数据与所述第一目标数据进行匹配,将所述目标块中的第二目标数据与所述目标块中的第三目标数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第三目标数据为所述目标块中位于所述第一目标数据之后的其他目标数据,并将所述目标块中位于所述第一子目标块之前的目标数据与所述目标块中位于所述第一目标数据之前的其他目标数据进行匹配,获取第一匹配结果;所述将所述第一子目标块的首个目标数据与所述第一参考数据进行匹配,将所述目标块中的第二目标数据与所述参考块中的第二参考数据进行匹配的匹配结果为第二匹配结果;
从所述第一匹配结果和所述第二匹配结果中选择匹配的目标数据的数据量大的匹配结果;根据选择的匹配结果生成所述第一编码序列;所述第一编码序列中还包括指示位,用于当所述第一匹配结果中匹配的目标数据量大于所述第二匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数据位于所述目标块,或用于当所述第二匹配结果中匹配的目标数据量大于所述第一匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数据位于所述参考块。
结合第二方面或第二方面的第一种可能的实现方式至第三种可能的实现方式,在第二方面的第五种可能的实现方式中,所述处理器还用于:
若在所述第一哈希表中未查到与所述第一键值对应的哈希值,则在所述生成第一编码序列之前,根据所述第一键值查询第二哈希表;
若在所述第二哈希表中查到与所述第一键值对应的哈希值,所述第二哈希表中键值对应的哈希值指示目标数据的地址;根据所述第二哈希表中所述第一键值对应的哈希值获取第一目标数据,将所述第一子目标块的首个目标数据与所述第一目标数据进行匹配,将所述目标块中的第二目标数据与所述目标块中的第三目标数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第三目标数据为所述目标块中位于所述第一目标数据之后的其他目标数据;
所述第一编码序列中还包括指示位,用于指示与本次匹配成功的目标数据相匹配的数据位于所述目标块。
结合第二方面的第四种可能的实现方式或第五种可能的实现方式,在第二方面的第六种可能的实现方式中,所述处理器还用于:
更新所述第二哈希表,使得所述第二哈希表中所述第一键值对应的哈希值指示所述第一子目标块的首个目标数据的地址。
结合第二方面或第二方面的第一种可能的实现方式至第三种可能的实现方式,在第二方面的第七种可能的实现方式中,所述处理器还用于:
在根据所述第一键值查询第一哈希表之后,若在所述第一哈希表中未查到与所述第一键值对应的哈希值,则根据所述第一键值查询第二哈希表;
若在所述第二哈希表中未查到与所述第一键值对应的哈希值,则更新所述第二哈希表,使得所述第二哈希表中所述第一键值对应的哈希值指示所述第一子目标块的首个目标数据的地址。
结合第二方面或第二方面的第一种可能的实现方式至第三种可能的实现方式,在第二方面的第八种可能的实现方式中,所述处理器还用于:
在对所述第一子目标块进行哈希运算得到第一键值之后,根据所述第一键值查询第二哈希表;
若在所述第二哈希表中未查到与所述第一键值对应的哈希值,则更新所述第二哈希表,使得所述第二哈希表中所述第一键值对应的哈希值指示所述 第一子目标块的首个目标数据的地址。
第三方面,提供一种编码装置,该编码装置包括用于执行第一方面的方法的模块。
相较于现有技术,本申请提出的方案能够更快的完成对子目标块的匹配,节省了数据压缩的时间,提高了数据压缩的效率。
附图说明
图1为本发明实施例中编码方法的流程图;
图2为本发明实施例中构建第一哈希表的过程的一种示意图;
图3为本发明实施例中编码序列的一种示意图;
图4为本发明实施例中编码方法的另一种流程图;
图5为本发明实施例中编码装置的结构框图;
图6为本发明实施例中编码装置的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
目前,在采用XDelta编码方式时,需要进行逐字节的匹配,那么,如果参考块的数据量比较大的话,匹配过程的工作量就比较大,耗费的时间也比较多,导致数据压缩的效率较低。
本发明充分考虑到以上问题,在得到一个子目标块时,先将其进行哈希运算,然后根据运算结果在参考块对应的第一哈希表中查询对应的哈希值,再根据查询到的哈希值在参考块中找到相应的位置,即找到相应的参考数据,从而可以将该子目标块在该位置处进行匹配,例如可以以找到的参考数据作 为起点,将子目标块进行向后匹配。这样,通过预先确定出一个大概的位置,缩小了需要匹配的范围,极大地减少了系统的工作量,节省了数据压缩的时间,提高了数据压缩的效率,也提高了系统性能。
在待压缩的目标块较多时,本发明实施例的优势较为明显。例如,待压缩的目标块有50个,其中每个目标块的大小为8K比特(bit),即目标块的总的大小为400Kbit,以及,有一个参考块,该参考块的大小为8Kbit。本发明实施例中如下介绍的是针对其中的每个目标块进行编码的方法,按照本发明实施例中的方法对这50个目标块均进行编码后,例如压缩率为50%,那么得到的压缩结果的大小就是200Kbit的编码序列加上8Kbit参考块的,比起400Kbit来说小了许多,节省了存储空间。
贯穿本说明书,以前述为例,目标块指代大小为8Kbit的数据块,参考块指代大小为8Kbit的参考块,目标块和参考块的大小一般相同,实际中参考块和目标块的大小可以自行定义,此处仅为实例性的;目标数据为目标块中的某一个数据,参考数据为参考块中的某一个数据,子目标块由目标块中连续n位目标数据构成,n为正整数。
下面结合说明书附图对本发明实施例作进一步详细描述。
请参见图1,本发明实施例提供一种编码方法,该方法的流程描述如下。
步骤101:获取第一子目标块,第一子目标块属于目标块;
步骤102:对第一子目标块进行哈希运算得到第一键值,根据第一键值查询第一哈希表,第一哈希表的键值对应的哈希值指示参考块中参考数据的地址;若在第一哈希表中查到与第一键值对应的第一哈希值,根据第一哈希值获取该第一哈希值指示的地址对应的第一参考数据,将第一子目标块的首个目标数据与第一参考数据进行匹配,将目标块中的第二目标数据与参考块中的第二参考数据进行匹配,第二目标数据为位于第一子目标块的首个目标数据之后的其他目标数据,第二参考数据为参考块中位于第一参考数据之后的其他参考数据;
步骤103:根据首个目标数据及第二目标数据的匹配结果生成第一编码序列;其中,第一编码序列中包括匹配长度和偏移量,匹配长度用于指示本次匹配成功的目标数据的长度,偏移量用于指示与本次匹配成功的目标数据相匹配的数据所在的位置。
本发明实施例中,第一子目标块可以包括待压缩的任意一个目标块中的连续n位数据,n为正整数。例如,可以预先规定,每次参与哈希运算的为第一数量个连续的数据,则第一子目标块就包括第一数量个连续的数据,即第一子目标块可以看作是第一数量个连续的数据构成的数据组合,或者称为数据段。即,第一数量为n。
在实际应用中,可以认为这里的第一子目标块是一个目标块中的一部分连续的数据,即一个目标块中的子目标块,即,实际的一个目标块中可以包括有较多的数据(包括的数据的数据量大于n)。
例如,共有8个目标块,其中每个目标块的大小为8Kbit,n为4,即第一子目标块包括其中一个目标块中的4位连续的数据。
因此,本发明实施例中的第一子目标块只是一个称谓,不代表实际情况。比如,在实际情况中,第一子目标块可能就是一个完整的目标块,或者,第一子目标块也可能是一个完整的目标块中的子目标块。
在需要对第一子目标块进行编码时,可以根据哈希运算算法将第一子目标块中包括的数据进行哈希运算。这里的哈希运算算法可以是预先规定的哈希运算算法,该哈希运算算法与第一哈希表所对应的哈希运算算法为同一算法。
哈希表例如为Key(键)-Value(值)形式,比如,将第一子目标块中的数据进行哈希运算,得到的是Key,第一哈希表中,Key与Value是一一对应的关系,则根据得到的Key,就可以在第一哈希表中查询对应的Value。其中,如果仅有一个参考块的情况下,Value可以用于指示该Value对应的第一参考数据的地址与参考块的首地址之间的距离。
因为哈希值表示两个数据的地址之间的距离,所以,一个哈希值可以认为对应于一个数据组合,比如对应一个子目标块或一个参考数据块,一个参考数据块包括n位参考数据,或者,一个哈希值也可以认为对应一个数据,比如对应一个子目标块中的首个数据或对应一个参考数据块中的首个数据。
例如,第一哈希表中Key与Value是一一对应的关系。查找第一哈希表的过程可以是:根据第一子目标块中的数据进行哈希运算,得到第一Key,则在第一哈希表中查找与第一Key相同的Key,如果在第一哈希表中找到了第一Key,则第一哈希表中的第一Key对应第一Value,所述第一Value就可以认为是第一子目标块对应的第一Key对应的第一哈希值。
例如,目标块中的第一子目标块包括的数据为BCFG,BCFG为连续的四个数据,本例中第一子目标块中的每一个数据为一个字符,实际中第一子目标块中的每一个数据可以为字符、数字或其他可能的数据种类,对该第一子目标块进行哈希运算,可以得到一个键值(Key),例如称为第一键值,针对该第一键值,可以在第一哈希表中查询对应的哈希值(例如称为第一哈希值)。例如对第一子目标块进行哈希运算得到第一Key,则在第一哈希表中查找与之相同的Key,只要在第一哈希表中找到与第一Key相同的Key(例如称为第二Key),则第一哈希表中的第二Key就对应一个Value(第一哈希值),对于查询到对应的第一哈希值的第二Key,可以确定与第一哈希值匹配的第一参考数据,第一参考数据位于参考块中,从而可以将第一子目标块的首个目标数据与第一参考数据进行匹配,将目标块中的第二目标数据与参考块中的第二参考数据进行匹配,即,在确定第一参考数据后,可以将第一子目标块及目标块中位于所述第一子目标块之后的目标数据与所述第一参考数据及参考块中位于所述第一参考数据之后的其他参考数据进行后向匹配(或者也可以称为向后匹配),直到无法匹配为止,得到该第一键值对应的匹配结果。
例如,第一子目标块包括的数据为BCFG,BCFG为连续的四个数据, 参考块为AEBCFGHIJKLMN,比如确定的第一参考数据为参考块中的数据B,那么,可以将第一子目标块从数据B开始,与参考块中的数据B以及数据B之后的其他数据进行匹配,本例中,显然第一子目标块中的BCFG均能够匹配成功,因为本发明实施例提供的方案是后向匹配,则,第一子目标块匹配成功后,对于目标块中位于第一子目标块之后的其他目标数据可以继续进行匹配,例如目标块为BCFGHIJKPQA,其中第一子目标块为BCFG,则第一子目标块匹配成功后,对于从数据H开始的其他目标数据可以继续进行匹配,直到无法匹配为止。显然在本例中,最终匹配得到的匹配结果为BCFGHIJK,也即本申请文件中的第一匹配结果。
再例如,第一子目标块包括的数据为BCFG,BCFG为连续的四个数据,参考块为AEBCFQHIJKLMN,比如确定的第一参考数据为参考块中的数据B,那么,可以将第一子目标块从数据B开始,与参考块中的数据B以及数据B之后的其他数据进行匹配,本例中,显然第一子目标块中的BCFG中的BCF能够匹配成功,而从数据G开始就无法匹配,那么目标块中位于第一子目标块之后的其他目标数据自然也无法进行匹配,因此在本例中,最终匹配得到的匹配结果为BCF,也即本申请文件中的第一匹配结果。
这里的第一参考数据,可以是参与运算得到第一哈希表中的第一哈希值的参考数据块中的第一个数据,即该参考数据块中的首地址对应的数据,或者,这里的第一参考数据也可以是指参与运算得到第一哈希表中的第一哈希值的该参考数据块。
当然,每次参与哈希运算的一个子目标块中包括的数据量也即进行哈希运算的数据的位数,本发明不作限制,可根据具体需求来设定。例如,如果子目标块越大,则每次参与哈希运算的数据越多,则系统的工作量较小,匹配过程所需的时间较短,子目标块越小,如果每次参与哈希运算的数据越少,则匹配过程较为细致,匹配结果可以更为准确。
本发明实施例中,如果是根据与第一哈希表进行匹配之后得到的匹配结 果生成第一编码序列,那么第一编码序列中包括的偏移量可以用于指示与目标块中本次匹配成功的目标数据相匹配的参考数据在参考块中所在的位置。
可选的,在根据第一键值查询第一哈希表之前,还包括:
按照第一步长,从参考块中获取参考数据块;每个参考数据块包括n位参考数据,第一子目标块包括n位目标数据,n为正整数;
构建第一哈希表,第一哈希表的键值为参考数据块通过哈希运算获取的。
即,可以根据已知的参考块中的参考数据预先构建第一哈希表,其中每次参与哈希运算的参考数据块包括n位参考数据。本发明实施例中,构建第一哈希表时,每次参与哈希运算的参考数据的数据量,与将目标块进行匹配时每次参与运算的模板数据的数据量(即一个子目标块所包括的数据的数据量)需相等,否则可能会出现得到的运算结果的长度不同而无法进行匹配的情况。
例如,请参见图2,为构建第一哈希表的过程。第一哈希表为HashTable(Key)=Value,在该实施例中,哈希运算算法例如为通过黄金分割素数进行运算的规则,即在该实施例中,一个Key表示对一个参考数据块通过黄金分割素数2654435761U计算得到的哈希值,这里使用黄金分割素数是具体实现时的一种选择,也可考虑使用其它素数,即也可以考虑使用其他的哈希运算算法,本发明实施例不作限制。Value表示一个参考数据块的首地址与参考块的首地址之间的距离,相当于表示该参考数据块在参考块中所在的位置。n的值可根据需要设定,本发明不作限制,例如可以设定为4。
其中,图2中,Value=p-head,p指代计算得到该Value对应的Key值的参考数据块在参考块中所在的位置(例如为计算得到该Value对应的Key值的参考数据块的首地址),head是参考块的第一个数据的位置(也即参考块的首地址),这两个地址之间的差就是该计算得到该Value对应的Key值的参考数据块的第一个数据和参考块的第一个数据之间距离,该距离即为Value的值。
图2中,Key=(*((unsigned int*)p)*2654435761U)>>19。unsigned int 为C语言中的一种数据类型,p表示指向计算得到该Value对应的Key值的参考数据块的首地址的指针,这个运算过程表示通过黄金分割素数进行的运算。
一般来说,哈希表越大,则压缩率会有一定程度提高,但性能势必会有所下降,所以恰到大小的哈希表非常重要。关于Value,Delta分块压缩领域一般的分块大小为4Kbit或8Kbit,所以可以采用2Byte(字节)表示Value(允许的最大分块为64Kbit),这样在保证够用的情况下既符合字节对齐又不存在字节冗余存在。另一方面,考虑到后续的编码序列中的Off(偏移量)可以用15bit表示当前数据与参考数据的距离,所以这里的2Byte的Value设计也足够使用。关于Key,考虑到一般服务器的CPU(中央处理器)的L1cache(一级缓存)的容量最小为32Kbit,而本发明实施例中的编码方案包括两个哈希表,因此每个哈希表的大小可设计为16Kbit,这样可保证速度性能较佳,因此可以设计第一哈希表的键数为8Kbit个,取Key为小于8Kbit的值。
可选的,在该实施例中,在进行哈希值计算填充第一哈希表的过程中,为提高构造和查找速度,可以采用抽样技术,这样可进一步提速,例如在抽样时可每次移位第一步长个数据p。
例如,待参与构建第一哈希表的参考块包括的数据为WILMBCFGAB,假设规定每次参与哈希运算的数据为4个,即假设n为4,那么进行哈希值计算填充第一哈希表的过程中,可以分别对参考块中包括的WILM、ILMB、LMBC、MBCF、BCFG、CFGA和FGAB等参考数据块进行哈希运算。而如果采用抽样的方式来选取参考数据块,例如规定每隔2个数据选取一个参考数据块,即step为2,那么参考块中参与哈希运算的参考数据块就包括:WILM、MBCF和FGAB。可见,通过抽样选取参考数据块的方式,使得参考块中待参与构建第一哈希表的参考数据块的数量减少,减轻系统的工作量。
继续以图2为例,图2中的第一行表示参考块中包括的数据,第二行和第三行分别表示Key和Value的计算过程,最后一行表示得到的第一哈希表,在第一哈希表中,例如Key为0时对应的Value为Value0,Key为1时对应的Value 为Value1,等等。例如第一数量为n,参考块的首地址为head,即图2中数据a所在的地址。图2中采用了抽样的方式来选取参考数据块,第一步长用step表示,那么,例如对于数据x来说,其对应的地址就可以是head+n*step。这样,通过在第一行中抽样选取参考数据块来进行计算,得到了第一哈希表。其中,p是指计算得到一个Value对应的Key值的参考数据块在参考块中所在的位置(例如为计算得到该Value对应的Key值的参考数据块的首地址)。
本发明实施例提供第一哈希表,将大量的子目标块同参考块的字符匹配过程替换为先通过第一哈希表查找,找到子目标块在参考块中的潜在匹配位置后再将该子目标块及目标块中位于该子目标块之后的目标数据与潜在匹配位置处的参考数据及参考块中位于该参考数据之后的其他参考数据进行后向匹配,减少了较多的匹配过程,系统性能得到了数量级的优化。
且本发明实施例中,在第一哈希表的构建过程中可采用抽样技术,系统性能根据抽样步长得到数倍提升,而抽样所带来的遗漏匹配损失通过匹配过程中的向后匹配过程可进行补偿。
可选的,第一哈希表的使用过程中,可以对第一哈希表进行更新。比如,根据参考数据块1计算得到Key1,参考数据块1对应的Value为Value1,将Key1和Value1添加到了第一哈希表中。处理过程中,又根据另外一个参考数据块,例如为参考数据块2计算得到了Key1,但参考数据块2对应的Value为Value2,此时,可以选择用Value2更新第一哈希表中的Value1。当然也可以选择不更新。
其中,如果用Value2更新为Value1,则表明在后续查找第一哈希表进行匹配的过程中,会找到离参考块的首地址更近的匹配数据,如果不用Value2更新Value1,则表明在后续查找第一哈希表进行匹配的过程中,会找到离端点较远的匹配数据。本发明实施例中,将第一子目标块及目标块中位于第一子目标块之后的目标数据与第一参考数据及参考块中位于第一参考数据之后的其他参考数据进行后向匹配,可以尽量弥补在构建第一哈希表过程中因 采用抽样技术而带来的损失。
可选的,生成第一编码序列前还包括:
将目标块中位于第一子目标块之前的目标数据与参考块中位于第一参考数据之前的其他参考数据进行匹配。
即,除了如前提到的后向匹配之外,还可以将第一子目标块及目标块中位于第一子目标块之前的目标数据与第一参考数据及参考块中位于第一参考数据之前的其他参考数据进行前向匹配(也可以称为向前匹配),直到无法匹配为止。那么在这种情况下,在生成第一编码序列时,就是根据根据后向匹配的结果及前向匹配的结果生成第一编码序列。
本发明实施例中,除了进行后向匹配之外,还可以进行前向匹配,可以使得匹配数据更多,以减少后续的匹配次数。
例如,目标块中包括的数据为ABCDEFGHIJKLMNOPQRST,设定n为4。例如,对HIJK做哈希运算,得到Key1,在第一哈希表中找到了Key1,第一哈希表中与Key1对应的Value为Value1,根据Value1,在参考块中找到对应的参考数据1,从参考块中的参考数据1开始,将HIJK,以及目标块中位于K之后的其他目标数据与参考块中的参考数据进行连续匹配,即向后匹配。例如,向后匹配的过程中匹配成功的目标数据为HIJKL,即目标数据M未匹配成功。另外,在参考块中找到对应的参考数据1后,还可以从参考块中的参考数据1开始,将HIJK,以及目标块中位于H之前的其他目标数据与参考块中的参考数据进行连续匹配,即向前匹配。例如,向前匹配的过程中匹配成功的目标数据为FG(不算HIJK),即目标数据E未匹配成功。那么,根据向后匹配的结果以及向前匹配的结果,可以得到最终的匹配结果,例如为FGHIJKL,那么可以根据FGHIJKL生成编码序列,也即本申请文件中的第二匹配结果。
其中,前向匹配和后向匹配的执行顺序可以任意
可选的,第一编码序列中还包括未匹配成功的目标数据,未匹配成功的 目标数据为上一个编码序列对应的匹配成功的最后一个目标数据,与本次匹配成功的第一个目标数据之间的目标数据。
或者可以理解为,未匹配成功的目标数据是:向前匹配成功的最后一个目标数据以及与第一编码序列相邻的上一个编码序列对应的最后一个目标数据之间的目标数据。
例如,第一编码序列中还包括Lit位,可以用于存储未匹配成功的数据。
例如,沿用上例,目标块中包括的目标数据为ABCDEFGHIJKLMNOPQRST,设定n为4。例如通过对HIJK进行哈希运算,通过前向匹配以及后向匹配,得到的最终匹配结果为FGHIJKL。那么,根据FGHIJKL生成编码序列1,编码序列1中可以包括:匹配长度为7,偏移量为参考数据1与参考块的首地址之间的距离。例如上一个相邻的编码序列所对应的目标块中的最后一个数据为目标块中的D,即,ABCD与FGHIJK之间的目标数据E未匹配成功,那么,在编码序列1中还可以包括Lit位,用于存储未匹配成功的目标数据,编码序列1中的未匹配成功的目标数据为E,以使得解码过程中根据上一相邻的编码序列能够获取该目标块中的ABCD,而根据本编码序列能够获取该目标块中的EFGHIJK。
可选的,除了构建第一哈希表之外,还可以构建第二哈希表,第二哈希表是与目标块对应的哈希表。构建第二哈希表的方式可以参考构建第一哈希表的方式,构建第二哈希表时,使用的哈希运算算法可以与第一哈希表的哈希运算算法相同,及,构建第二哈希表时,每次参与哈希运算的目标数据的数据量也可以是n。
在得到一个子目标块后,除了可以将其与第一哈希表进行匹配之外,还可以将其与第二哈希表进行匹配。那么可能会得到四种匹配结果:1、在第一哈希表中和第二哈希表中都查到了与对该子目标块进行哈希运算得到的键值对应的哈希值;2、在第一哈希表中未查到与对该子目标块进行哈希运算得到的键值对应的哈希值,在第二哈希表中查到了与对该子目标块进行哈 希运算得到的键值对应的哈希值;3、在第一哈希表中和第二哈希表中都未查到与对该子目标块进行哈希运算得到的键值对应的哈希值。4、在第一哈希表中查到了与对该子目标块进行哈希运算得到的键值对应的哈希值,在第二哈希表中未查到与对该子目标块进行哈希运算得到的键值对应的哈希值。下面分别介绍这几种情况;
第一种情况:
可选的,
在对第一子目标块进行哈希运算得到第一键值之后,还包括:
根据第一键值查询第二哈希表;
根据第一键值查询第二哈希表;若在第二哈希表中查到与第一键值对应的第二哈希值,第二哈希表中的键值对应的哈希值指示目标块中目标数据的地址,根据第二哈希值获取第一目标数据,将第一子目标块的首个目标数据与第一目标数据进行匹配,将目标块中的第二目标数据与目标块中的第三目标数据进行匹配,第二目标数据为位于第一子目标块的首个目标数据之后的其他目标数据,第三目标数据为目标块中位于第一目标数据之后的其他目标数据,并将目标块中位于第一子目标块之前的目标数据与目标块中位于第一目标数据之前的其他目标数据进行匹配,获取第一匹配结果;
将第一子目标块的首个目标数据与第一参考数据进行匹配,将目标块中的第二目标数据与参考块中的第二参考数据进行匹配的匹配结果为第二匹配结果;生成第一编码序列,包括:
从第一匹配结果和第二匹配结果中选择匹配的目标数据的数据量大的匹配结果;根据选择的匹配结果生成第一编码序列;第一编码序列中还包括指示位,用于当第一匹配结果中匹配的目标数据量大于第二匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数据位于目标块,或用于当第二匹配结果中匹配的目标数据量大于第一匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数据位于参考块。
第二哈希表与第一哈希表一样,也可以采用Key-Value形式,比如,将第一子目标块中的数据进行哈希运算,得到的是Key,第二哈希表中,Key与Value是一一对应的关系,则根据得到的Key,就可以在第二哈希表中查询对应的Value。其中,Value例如可以用于指示该Value对应的子目标块中的数据的首地址与某个特定的目标块的首地址之间的距离,这个特定的目标块,可以是所有目标块中的第一个目标块,当然也可以是其他的目标块。
第二哈希表中可以既包括Key也包括Value,Key与Value是一一对应的关系。查找第二哈希表的过程可以是:根据第一子目标块中的目标数据进行哈希运算,得到Key(例如称为第一Key),则在第二哈希表中查找与其相同的Key,如果在第二哈希表中找到了与其相同的Key(例如称为第二Key),则第二哈希表中的第二Key对应一个Value(例如称为第一Value),则第一Value就可以认为是第一Key对应的哈希值。
在该实施例中,根据得到的运算结果分别在第一哈希表和第二哈希表中查询对应的哈希值之后,在第一哈希表和第二哈希表中都查询到了对应的哈希值,则可以分别在第一哈希表和第二哈希表中进行匹配。
即,在第一哈希表中查询到与该第一键值对应的第一哈希值,根据第一哈希值确定参考块中的第一参考数据,相当于确定了一个潜在的位置,则将第一子目标块及目标块中位于第一子目标块之后的目标数据与第一参考数据及参考块中位于第一参考数据之后的其他参考数据进行后向匹配,以弥补在构建第一哈希表时因抽样过程带来的遗漏损失,直到无法匹配时,将得到的匹配结果称为第一匹配结果。或者,除了可以后向匹配外,还可以将第一子目标块及目标块中位于第一子目标块之前的目标数据与第一参考数据及参考块中位于第一参考数据之前的其他参考数据进行前向匹配,直到前后都不再匹配,此时得到的匹配结果可以认为是当前匹配点的最大匹配结果,也可以将该匹配结果称为第一匹配结果。
同样的,在第二哈希表中查询到与该第一键值对应的第二哈希值,根据 第二哈希值确定目标块中对应的第三目标数据,则将第一子目标块及目标块中位于第一子目标块之后的目标数据与第三目标数据及目标块中位于第三目标数据之后的其他目标数据进行后向匹配,直到无法匹配时,将得到的匹配结果称为第二匹配结果。或者,除了可以后向匹配外,还可以将第一子目标块及目标块中位于第一子目标块之前的目标数据与第三目标数据及目标块中位于第三目标数据之前的其他目标数据进行前向匹配,直到前后都不再匹配,此时得到的匹配结果可以认为是当前匹配点的最大匹配结果,可以将该匹配结果称为第二匹配结果。在第二哈希表中进行匹配的过程,与如前举例介绍的在第一哈希表中进行匹配的过程类似,此处不再多举例。
在得到两个匹配结果后,再从这两个匹配结果中选择字节数量多的(即数据量大的)匹配结果,将选择的匹配结果作为该第一键值的最终匹配结果,并根据选择的匹配结果生成第一编码序列。在这种情况下,第一编码序列中还可以包括指示位,用于指示第一编码序列对应的目标数据相匹配的数据位于参考块中还是位于目标块中,即用于指示选择的匹配结果是根据第一哈希表得到的匹配结果还是根据第二哈希表得到的匹配结果。
其中,如果该指示位指示第一编码序列对应的目标数据相匹配的数据位于参考块中,那么第一编码序列中的偏移量用于指示参考块中本次匹配成功的首个参考数据的地址(可能是第一参考数据,也可能是位于第一参考数据之前的数据)与参考块的首地址之间的距离,如果该指示位指示第一编码序列对应的目标数据相匹配的数据位于目标块中,那么第一编码序列中的偏移量用于指示目标块中本次匹配成功的首个目标数据的地址(可能是第一目标数据,也可能是位于第一目标数据之前的数据)与特定的目标块的首地址之间的距离。
在这种情况下,同一个哈希值在第一哈希表和第二哈希表中可能都会出现,所以在根据运算结果进行查表及匹配时可能会得到两个匹配结果,那么可以选择数据量较大的匹配结果进行编码,这样做可以提高匹配准确度,及 提高压缩率。
可选的,所述方法还包括:
更新第二哈希表,使得第二哈希表中第一键值对应的哈希值指示第一子目标块的首个目标数据的地址。
对于不同的数据,在采用同一哈希运算算法进行哈希运算后,可能会得到相同的运算结果,即得到相同的Key,或者,如果目标块中有重复的数据段,那么在采用同一哈希运算算法进行哈希运算后,一般会得到相同的运算结果,即得到相同的Key。
例如目标块1中有两个子目标块,分别为子目标块1和子目标块2,子目标块1包括的目标数据为ABCD,子目标块2包括的目标数据也是ABCD,即这两个子目标块是目标块1中的重复数据块。先对子目标块1进行了哈希运算,得到了Key1,Key1对应的Value为Value1,Key1和Value1存储在第二哈希表中。后来又对子目标块2进行哈希运算,得到的Key也是Key1,但子目标块1与子目标块2在目标块中的位置是不同的,即子目标块1和子目标块2对应的Value是不同的,例如子目标块2对应的Value为Value2。那么,可以选择用Value2更新第二哈希表中的Value1。
即,在查询第二哈希表之后,可以根据本次获取的第一键值对应的哈希值来更新第二哈希表中存储的的第一键值对应的哈希值,这里的更新指用新的哈希值替换第二哈希表中的旧的哈希值,这样可以保证第二哈希表中的哈希值得到及时更新,提高下一次的匹配成功率。
根据如上描述可知,可以更新第二哈希表,也可以不更新第二哈希表。例如,若不更新第二哈希表,则同一个哈希值在第一哈希表和第二哈希表中可能最多出现一次,时间性能较好,也就是说可以缩短数据压缩所需要的时间。
理论和实验都证实,无论更新第二哈希表或不更新第二哈希表,最终的压缩率和性能等差别不大,因此,是否要更新第二哈希表,可以根据实际情 况确定。
第二种情况:
可选的,若在第一哈希表中未查到与第一键值对应的哈希值,生成第一编码序列前还包括:
根据第一键值查询第二哈希表;
若在第二哈希表中查到与第一键值对应的哈希值,第二哈希表中键值对应的哈希值指示目标数据的地址;根据第二哈希表中第一键值对应的哈希值获取第一目标数据,将第一子目标块的首个目标数据与第一目标数据进行匹配,将目标块中的第二目标数据与目标块中的第三目标数据进行匹配,第二目标数据为位于第一子目标块的首个目标数据之后的其他目标数据,第三目标数据为目标块中位于第一目标数据之后的其他目标数据;
第一编码序列中还包括指示位,用于指示与本次匹配成功的目标数据相匹配的数据位于目标块。
即,根据得到的第一键值分别在第一哈希表和第二哈希表中查询对应的哈希值之后,在第二哈希表中查询到了对应的哈希值,而在第一哈希表中未查询到对应的哈希值,则在第二哈希表中进行匹配。
例如,在第二哈希表中查询到与该运算结果对应的哈希值(例如称为第二哈希值),根据第二哈希值确定目标块中对应的第一目标数据,相当于确定了一个潜在的位置,则将第一子目标块及目标块中位于第一子目标块之后的目标数据与第一目标数据及目标块中位于第一目标数据之后的其他目标数据进行后向匹配,可以根据得到的匹配结果生成第一编码序列。或者,除了进行后向匹配外,还可以将第一子目标块及目标块中位于第一子目标块之前的目标数据与第一目标数据及目标块中位于第一目标数据之前的其他目标数据进行前向匹配,直到前后都不再匹配,此时得到的匹配结果可以认为是当前匹配点的最大匹配结果,可以根据该匹配结果生成第一编码序列。
可选的,所述方法还包括:
更新第二哈希表,使得第二哈希表中第一键值对应的哈希值指示第一子目标块的首个目标数据的地址。
对该实施例的介绍可参考第一种情况中的描述,即,可以更新第二哈希表,也可以不更新第二哈希表。例如,若不更新第二哈希表,则同一个哈希值在第一哈希表和第二哈希表中可能最多出现一次,时间性能较好,也就是说可以缩短数据压缩所需要的时间。
理论和实验都证实,无论更新第二哈希表或不更新第二哈希表,最终的压缩率和性能等差别不大,因此,究竟是否要更新第二哈希表,可以根据实际情况确定。
第三种情况:
可选的,
在根据第一键值查询第一哈希表之后,还包括:
若在第一哈希表中未查到与第一键值对应的哈希值,则根据第一键值查询第二哈希表;
根据得到的匹配结果生成第一编码序列,包括:
若在第二哈希表中未查到与第一键值对应的哈希值,则更新第二哈希表,使得第二哈希表中第一键值对应的哈希值指示第一子目标块的首个目标数据的地址。
即,根据得到的第一键值分别在第一哈希表和第二哈希表中查询对应的哈希值之后,在第一哈希表和第二哈希表中都未查询到对应的哈希值,则无法进行匹配。
在这种情况下,可以将该第一键值对应的哈希值插入第二哈希表中的相应位置,即根据该第一键值对应的哈希值更新第二哈希表,这样,在下次再出现该第一键值时,就可以在第二哈希表中查找到对应的哈希值。当然,也可以选择不更新第二哈希表,可根据实际情况确定是否更新。
第四种情况:
可选的,
在对第一子目标块进行哈希运算得到第一键值之后,还包括:
根据第一键值查询第二哈希表;
若在第二哈希表中未查到与第一键值对应的哈希值,则更新第二哈希表,使得第二哈希表中第一键值对应的哈希值指示第一子目标块的首个目标数据的地址。
即,根据得到的第一键值分别在第一哈希表和第二哈希表中查询对应的哈希值之后,在第一哈希表中查询到了对应的哈希值,而在第二哈希表中未查询到对应的哈希值,则在第一哈希表中进行匹配,并根据通过第一哈希表匹配得到的匹配结果生成第一编码序列,其中,匹配过程以及生成第一编码序列等过程可参考如前的描述。
在这种情况下,可以将该第一键值对应的哈希值插入第二哈希表中的相应位置,即根据该第一键值对应的哈希值更新第二哈希表,这样,在下次再出现该第一键值时,就可以在第二哈希表中查找到对应的哈希值。当然,也可以选择不更新第二哈希表,可根据实际情况确定是否更新。
实际使用中,例如有一个8K bit的参考块,以及一个8K bit的待编码的目标块,首先可以根据参考块生成第一哈希表,具体的生成过程参考本实施例前述部分,然后开始对该目标块进行编码,例如目标块为ABCDEFGHIJKABCD,以step为1,n为4为例;首先对目标块中的子目标块ABCD进行哈希运算,获得ABCD对应的键值后,在第一哈希表中查询,如果在第一哈希表中查询到了相同的键值,则可以根据本实施例前述部分输出编码序列,如果在第一哈希表中未查询到相同的键值,则可以将ABCD对应的键值存入第二哈希表中,同时第二哈希表中该键值对应的哈希值为数据A对应的地址。随后,当获取了子目标块BCDE后,对其进行哈希运算获取对应的键值并在第一哈希表和第二哈希表中查询,根据查询的结果输出编码序列,具体的处理过程参考本实施例前述的查询两个哈希表的四种情况的内容 部分;那么,当处理到目标块中第十二位数据时,再次出现了子目标块ABCD,这个子目标块对应的键值在第一哈希表中仍然无法找到,但是由于子目标块ABCD对应的键值以及哈希值已经存入了第二哈希表,因此当子目标块ABCD再次出现并寻求匹配时,在第二哈希表中将可以查询到其对应的哈希值,通过第二哈希表的运用,使得目标块中后面进行处理的子目标块,可以和目标块中前面已经处理过的子目标块进行匹配,不再仅仅只能和参考块中的参考数据进行匹配,进一步提升了编码的效率以及匹配的成功率。
本发明实施例中,在得到一个匹配结果后,即可根据该匹配结果生成一个编码序列,并可以输出该编码序列,以完成编码过程。
请参见图3,为本发明实施例中提供的编码序列的一种可能的示意图。图3中,Head表示序列头,是整个编码序列的开始节点,LLen表示字符长度,是编码序列中可能存在的字段,是对Head中的字符长度存满的情况的补充。Lit表示字符,是编码序列中可能存在的字段,用于存储目标块中未匹配成功的字符,即未匹配的目标数据。Off表示偏移量,是编码序列中必然存在的字段,Off中可以包括如前所述的指示位和偏移量。MLen表示匹配长度,是编码序列中可能存在的字段,是对Head中匹配长度存满的情况的补充。
其中,Head例如可以为固定长度,例如该长度可以是1Byte。在Head中,高h1bit可用于存放字符长度,例如Head中高h1个bit的存放方式如下,例如h1为16:
0:无;
1~14:用于存放字符长度;
15:15+Llen用于存放字符长度。
其中,字符长度是指上一次匹配结束到本次匹配开始之间未得到匹配的目标数据的长度。15+Llen表示,当字符长度达到15时,就新开辟一个空间用于存放多出来的字符长度,那么超过15的这部分字符长度,即多出来的字符长度用Llen表示。
在Head中,低h2bit可用于存放匹配长度,例如Head中低h2个bit的存放方式如下,例如h2为16:
0:无;
1~14:用于存放匹配长度;
15:15+Llen用于存放匹配长度。
其中,匹配长度是指本次连续匹配的字节的长度,即本次匹配成功的数据长度。15+Llen表示,当匹配长度达到15时,就新开辟一个空间用于存放多出来的匹配长度,那么超过15的这部分匹配长度,即多出来的匹配长度用Llen表示。
这里的h1和h2的取值只是举例,具体可以根据需要来分配。
由于Head中用于存放字符长度及匹配长度的空间都可变,即可以根据实际的长度来增加新的空间,那么,以字符长度举例,如果字符长度较短,则占用的Head的存储空间较少,能够提高数据压缩率,如果字符长度较长,则Head的空间也可变,能够满足各种不同的需求。
LLen用于存放字符长度,例如LLen的长度可变,例如可变范围为[0,n]Byte。例如,当Head中存放的字符长度达到(2^h1-1)时,分配1Byte作为LLen,用于存放字符长度,当该分配的1Byte的存放值达到255时,再次分配1Byte作为LLen,用于存放字符长度,直到分配的作为LLen的Byte的存放值小于255。
Lit的长度例如可变,可变范围例如为[0,n]Byte。Lit用于存储目标块的原数据,即目标块中未匹配成功的目标数据。例如,目标块中的字符QW未匹配成功,则可以将QW存储在Lit中。Lit的长度例如可以为Head的高h1bit和LLen的bit数量之和。
Off的长度例如为固定长度,该长度例如为2Byte。Off的高1bit可用于存放匹配块的类型,即用于表示与一个目标块匹配成功的数据段究竟是参考块中的数据段还是目标块中的数据段,例如:
0:目标块;
1:参考块。
Off的低15bit可用于存放偏移量,例如:
若匹配成功的数据块为目标块中的数据块,例如,是将第一子目标块进行匹配,而与第一子目标块匹配的数据块位于目标块中,则偏移量用于表示与本次匹配成功的目标数据相匹配的目标数据在目标块中所在的位置;
若匹配的数据段为参考块中的数据段,则偏移量用于表示与本次匹配成功的目标数据相匹配的参考数据在参考块里所在的位置。
通过Off的指示,可以明确匹配成功的数据究竟是目标数据还是参考数据,从而能够准确地进行压缩以及恢复原数据,有效地支持了如前所述的通过两个哈希表进行匹配的方案,性能较好。
MLen的长度例如可变,可变的范围例如为[0,n]Byte。例如,当Head中存放的匹配长度达到(2^h2-1)时,可分配1Byte作为MLen,存放字符长度,当该分配的1Byte的存放值达到255时,再次分配1Byte作为MLen,用于存放字符长度,直到分配的作为MLen的Byte的存放值小于255。
本发明实施例中提供的这种编码序列应用于编码(例如可以是Delta编码)等领域,具有压缩率高和性能好等优点。
如前介绍的都是编码过程,下面介绍一下对第一编码序列进行解码的方式,以完善整个流程。
例如,要对第一编码序列进行解码,可以根据第一编码序列的Off中的指示位获知匹配成功的数据是位于参考块中还是目标块中,例如Off中的指示位的值为1,比如指示位为1时表明指示的是参考块,则可以确定与第一编码序列匹配成功的数据位于参考块中。根据Off中的偏移量,可以获知与第一编码序列匹配成功的第一个参考数据在参考块中的位置,例如参考块包括的参考数据为ABCDEFGHIJKLMNOPQRST,Off中的偏移量指示的参考数据是其中的参考数据I。根据第一编码序列的Head获知匹配成功的数据的位数为6,以及未匹配成功的数据的位数为2,那么,可以根据参考块获知, 匹配成功的数据为IJKLMN,以及,可以根据第一编码序列的Lit得到未匹配成功的数据,例如为WK,则,可以得到第一编码序列对应的目标数据,为WKIJKLMN。
下面介绍一个较为完整的例子,用于更好地理解本发明各个实施例提供的技术方案。
请参见图4,为本发明实施例提供的编码方法的流程图。
步骤1、获取第一子目标块。例如,共有10个大小均为8Kbit的目标块,要对其均进行编码,这里介绍的是对其中的一个目标块中的一个子目标块进行编码的过程,对其他目标块的处理过程都可以类似。第一子目标块例如可以是这10个目标块中的任意一个子目标块,例如第一子目标块中包括n位目标数据。
步骤2、对第一子目标块进行哈希运算得到键值1,根据键值1,分别查询第一哈希表和第二哈希表。若在第一哈希表和第二哈希表中都查询到与键值1对应的哈希值1,则执行步骤3,若在第一哈希表中查询到与键值1对应的哈希值1,在第二哈希表中未查询到与键值1对应的哈希值1,则执行步骤7,若在第一哈希表中未查询到与键值1对应的哈希值1,在第二哈希表中查询到与键值1对应的哈希值1,则执行步骤8,若在第一哈希表和第二哈希表中都未查询到与键值1对应的哈希值1,则执行步骤9。
步骤3、将第一子目标块分别根据第一哈希表中查询到的第一哈希值和第二哈希表中查询到的第二哈希值进行匹配,得到两个匹配结果。在进行匹配时,可以只进行后向匹配,或者也可以均进行前向匹配及后向匹配。另外,还可以根据第一键值对应的哈希值更新第二哈希表中的第二哈希值。匹配过程可参考如前的描述。
步骤4、从两个匹配结果中选择数据量大的匹配结果,例如选择的是匹配结果1。
步骤5、根据确定的匹配结果生成并输出编码序列1。编码序列1所包 含的内容以及格式等可参考图1流程中的相关描述。确定的匹配结果可能是匹配结果1,也可能是后续即将介绍的匹配结果2。执行步骤6。
步骤6、判断是否还有待匹配的子目标块。其中,若还有待匹配的子目标块,则执行步骤1,若没有待匹配的子目标块,则结束流程。
步骤7、将第一子目标块根据第一哈希表中查询到的第一哈希值进行匹配,得到匹配结果,例如为匹配结果1。执行步骤5。这个匹配结果,可以是后向匹配得到的匹配结果,或者也可以是前向匹配加后向匹配得到的匹配结果。另外,还可以根据第一键值对应的哈希值更新第二哈希表中的第二哈希值。
步骤8、将第一子目标块根据第二哈希表中查询到的第二哈希值进行匹配,得到匹配结果,例如为匹配结果2。执行步骤5。这个匹配结果,可以是后向匹配得到的匹配结果,或者也可以是前向匹配加后向匹配得到的匹配结果。另外,还可以根据第一键值对应的哈希值更新第二哈希表中的第二哈希值。
步骤9、根据第一键值对应的哈希值更新第二哈希表中的第二哈希值。
其中,图4流程中所涉及的匹配过程、生成编码序列的过程、编码序列中所包括的信息等内容,以及其他未详细介绍的内容,均可参考图1流程中的相关介绍。
下面结合附图介绍本发明实施例中的设备。
请参见图5,基于同一发明构思及上述各实施例,本发明实施例提供一种编码装置,该编码装置可以包括获取模块501和处理模块502,该编码装置可以用于实现如图1流程及如图2流程中所述的方法。
例如,获取模块501可以用于获取第一子目标块,对于图1流程及图2流程中的其他各个步骤,均可以由处理模块502执行。因各个步骤在如前的实施例中均已进行了介绍,因此此处不多赘述,对于获取模块501和处理模块502所完成的功能、以及相应的执行过程等,均可参考如前实施例的描述。
请参见图6,基于同一发明构思及上述各实施例,本发明实施例提供一种编码装置,该编码装置可以包括存储器601和处理器602,该编码装置与图5中的编码装置可以是同一设备。
处理器602例如可以是CPU(中央处理器)或ASIC(Application Specific Integrated Circuit,特定应用集成电路),可以是一个或多个用于控制程序执行的集成电路,可以是使用FPGA(Field Programmable Gate Array,现场可编程门阵列)开发的硬件电路,可以是基带芯片。存储器601的数量可以是一个或多个。存储器601可以包括ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)和磁盘存储器。
存储器601可以通过总线与处理器602相连接(图6以此为例),或者也可以通过专门的连接线与处理器602连接。
通过对处理器602进行设计编程,将前述所示的方法所对应的代码固化到芯片内,从而使芯片在运行时能够执行前述实施例中的所示的方法。如何对处理器602进行设计编程为本领域技术人员所公知的技术,这里不再赘述。
例如,存储器601可以用于存储处理器602执行任务所需的指令,处理器602通过执行存储器601中存储的指令,可以实现图5中的获取模块501和处理模块502,即,对于图1流程及图2流程中的其他各个步骤,均可以由处理器602执行。因各个步骤在如前的实施例中均已进行了介绍,因此此处不多赘述,对于处理器602所完成的功能、以及相应的执行过程等,均可参考如前实施例的描述。
本发明实施例中,在得到目标块中的一个子目标块时(例如称为第一子目标块),先将其进行哈希运算,然后根据运算结果在第一哈希表中查询对应的哈希值,再根据查询到的哈希值在参考块中找到相应的位置,即找到第一参考数据,从而可以将第一子目标块在该位置处进行向后匹配。这样,通过预先确定出一个大概的位置,缩小了需要匹配的范围,极大地减少了系统的工作量,节省了数据压缩的时间,提高了数据压缩的效率,也提高了系统 性能。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能单元的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元完成,即将装置的内部结构划分成不同的功能单元,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算 机,服务器,或者网络设备等)或processor(处理器)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以对本申请的技术方案进行了详细介绍,但以上实施例的说明只是用于帮助理解本发明的方法及其核心思想,不应理解为对本发明的限制。本技术领域的技术人员在本发明揭露的技术范围内,可
轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。

Claims (21)

  1. 一种编码方法,其特征在于,包括:
    获取第一子目标块,所述第一子目标块属于目标块;
    对所述第一子目标块进行哈希运算得到第一键值,根据所述第一键值查询第一哈希表,所述第一哈希表的键值对应的哈希值指示参考块中参考数据的地址;若在所述第一哈希表中查到与所述第一键值对应的第一哈希值,根据所述第一哈希值获取该第一哈希值指示的地址对应的第一参考数据,将所述第一子目标块的首个目标数据与所述第一参考数据进行匹配,将所述目标块中的第二目标数据与所述参考块中的第二参考数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第二参考数据为所述参考块中位于所述第一参考数据之后的其他参考数据;
    根据所述首个目标数据及所述第二目标数据的匹配结果生成第一编码序列;其中,所述第一编码序列中包括匹配长度和偏移量,所述匹配长度用于指示本次匹配成功的目标数据的长度,所述偏移量用于指示与本次匹配成功的目标数据相匹配的数据所在的位置。
  2. 如权利要求1所述的方法,其特征在于,所述根据所述第一键值查询第一哈希表之前,还包括:
    按照第一步长,从所述参考块中获取参考数据块;每个参考数据块包括n位参考数据,所述第一子目标块包括n位目标数据,所述n为正整数;
    构建所述第一哈希表,所述第一哈希表的键值为所述参考数据块通过所述哈希运算获取的。
  3. 如权利要求1或2所述的方法,其特征在于,所述生成第一编码序列前还包括:将所述目标块中位于所述第一子目标块之前的目标数据与所述参考块中位于所述第一参考数据之前的其他参考数据进行匹配。
  4. 如权利要求3所述的方法,其特征在于,所述第一编码序列中还包括未匹配成功的目标数据,所述未匹配成功的目标数据为上一个编码序列对 应的匹配成功的最后一个目标数据,与本次匹配成功的第一个目标数据之间的目标数据。
  5. 如权利要求1-4任一所述的方法,其特征在于,
    所述对所述第一子目标块进行所述哈希运算得到第一键值后,还包括:
    根据所述第一键值查询第二哈希表;若在所述第二哈希表中查到与所述第一键值对应的第二哈希值,所述第二哈希表中的键值对应的哈希值指示所述目标块中目标数据的地址,根据所述第二哈希值获取第一目标数据,将所述第一子目标块的首个目标数据与所述第一目标数据进行匹配,将所述目标块中的第二目标数据与所述目标块中的第三目标数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第三目标数据为所述目标块中位于所述第一目标数据之后的其他目标数据,并将所述目标块中位于所述第一子目标块之前的目标数据与所述目标块中位于所述第一目标数据之前的其他目标数据进行匹配,获取第一匹配结果;
    所述将所述第一子目标块的首个目标数据与所述第一参考数据进行匹配,将所述目标块中的第二目标数据与所述参考块中的第二参考数据进行匹配的匹配结果为第二匹配结果;
    所述生成第一编码序列,包括:从所述第一匹配结果和所述第二匹配结果中选择匹配的目标数据的数据量大的匹配结果;根据选择的匹配结果生成所述第一编码序列;所述第一编码序列中还包括指示位,用于当所述第一匹配结果中匹配的目标数据量大于所述第二匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数据位于所述目标块,或用于当所述第二匹配结果中匹配的目标数据量大于所述第一匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数据位于所述参考块。
  6. 如权利要求1-4任一所述的方法,其特征在于,若在所述第一哈希表中未查到与所述第一键值对应的哈希值,所述生成第一编码序列前还包括:
    根据所述第一键值查询第二哈希表;
    若在所述第二哈希表中查到与所述第一键值对应的哈希值,所述第二哈希表中键值对应的哈希值指示目标数据的地址;根据所述第二哈希表中所述第一键值对应的哈希值获取第一目标数据,将所述第一子目标块的首个目标数据与所述第一目标数据进行匹配,将所述目标块中的第二目标数据与所述目标块中的第三目标数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第三目标数据为所述目标块中位于所述第一目标数据之后的其他目标数据;
    所述第一编码序列中还包括指示位,用于指示与本次匹配成功的目标数据相匹配的数据位于所述目标块。
  7. 如权利要求5或6所述的方法,其特征在于,还包括:
    更新所述第二哈希表,使得所述第二哈希表中所述第一键值对应的哈希值指示所述第一子目标块的首个目标数据的地址。
  8. 一种编码装置,其特征在于,包括:
    获取模块,用于获取第一子目标块,所述第一子目标块属于目标块;
    处理模块,用于:对所述第一子目标块进行哈希运算得到第一键值,根据所述第一键值查询第一哈希表,所述第一哈希表的键值对应的哈希值指示参考块中参考数据的地址;若在所述第一哈希表中查到与所述第一键值对应的第一哈希值,根据所述第一哈希值获取该第一哈希值指示的地址对应的第一参考数据,将所述第一子目标块的首个目标数据与所述第一参考数据进行匹配,将所述目标块中的第二目标数据与所述参考块中的第二参考数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第二参考数据为所述参考块中位于所述第一参考数据之后的其他参考数据;
    根据所述首个目标数据及所述第二目标数据的匹配结果生成第一编码序列;其中,所述第一编码序列中包括匹配长度和偏移量,所述匹配长度用于指示本次匹配成功的目标数据的长度,所述偏移量用于指示与本次匹配成 功的目标数据相匹配的数据所在的位置。
  9. 如权利要求8所述的装置,其特征在于,所述处理模块还用于:
    在所述根据所述第一键值查询第一哈希表之前,按照第一步长,从所述参考块中获取参考数据块;每个参考数据块包括n位参考数据,所述第一子目标块包括n位目标数据,n为正整数;
    构建所述第一哈希表,所述第一哈希表的键值为所述参考数据块通过所述哈希运算获取的。
  10. 如权利要求8或9所述的装置,其特征在于,所述处理模块还用于:
    在所述生成第一编码序列前,将所述目标块中位于所述第一子目标块之前的目标数据与所述参考块中位于所述第一参考数据之前的其他参考数据进行匹配。
  11. 如权利要求10所述的装置,其特征在于,所述第一编码序列中还包括未匹配成功的目标数据,所述未匹配成功的目标数据为上一个编码序列对应的匹配成功的最后一个目标数据,与本次匹配成功的第一个目标数据之间的目标数据。
  12. 如权利要求8-11任一所述的装置,其特征在于,所述处理模块还用于:
    所述对所述第一子目标块进行所述哈希运算得到第一键值之后,根据所述第一键值查询第二哈希表;若在所述第二哈希表中查到与所述第一键值对应的第二哈希值,所述第二哈希表中的键值对应的哈希值指示所述目标块中目标数据的地址,根据所述第二哈希值获取第一目标数据,将所述第一子目标块的首个目标数据与所述第一目标数据进行匹配,将所述目标块中的第二目标数据与所述目标块中的第三目标数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第三目标数据为所述目标块中位于所述第一目标数据之后的其他目标数据,并将所述目标块中位于所述第一子目标块之前的目标数据与所述目标块中位于所述第 一目标数据之前的其他目标数据进行匹配,获取第一匹配结果;
    所述将所述第一子目标块的首个目标数据与所述第一参考数据进行匹配,将所述目标块中的第二目标数据与所述参考块中的第二参考数据进行匹配的匹配结果为第二匹配结果;
    从所述第一匹配结果和所述第二匹配结果中选择匹配的目标数据的数据量大的匹配结果;根据选择的匹配结果生成所述第一编码序列;所述第一编码序列中还包括指示位,用于当所述第一匹配结果中匹配的目标数据量大于所述第二匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数据位于所述目标块,或用于当所述第二匹配结果中匹配的目标数据量大于所述第一匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数据位于所述参考块。
  13. 如权利要求8-11任一所述的装置,其特征在于,所述处理模块还用于:
    若在所述第一哈希表中未查到与所述第一键值对应的哈希值,则在所述生成第一编码序列之前,根据所述第一键值查询第二哈希表;
    若在所述第二哈希表中查到与所述第一键值对应的哈希值,所述第二哈希表中键值对应的哈希值指示目标数据的地址;根据所述第二哈希表中所述第一键值对应的哈希值获取第一目标数据,将所述第一子目标块的首个目标数据与所述第一目标数据进行匹配,将所述目标块中的第二目标数据与所述目标块中的第三目标数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第三目标数据为所述目标块中位于所述第一目标数据之后的其他目标数据;
    所述第一编码序列中还包括指示位,用于指示与本次匹配成功的目标数据相匹配的数据位于所述目标块。
  14. 如权利要求12或13所述的装置,其特征在于,所述处理模块还用于:
    更新所述第二哈希表,使得所述第二哈希表中所述第一键值对应的哈希值指示所述第一子目标块的首个目标数据的地址。
  15. 一种编码装置,其特征在于,包括:
    存储器,用于存储指令;
    处理器,用于通过执行所述指令:
    获取第一子目标块,所述第一子目标块属于目标块;
    对所述第一子目标块进行哈希运算得到第一键值,根据所述第一键值查询第一哈希表,所述第一哈希表的键值对应的哈希值指示参考块中参考数据的地址;若在所述第一哈希表中查到与所述第一键值对应的第一哈希值,根据所述第一哈希值获取该第一哈希值指示的地址对应的第一参考数据,将所述第一子目标块的首个目标数据与所述第一参考数据进行匹配,将所述目标块中的第二目标数据与所述参考块中的第二参考数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第二参考数据为所述参考块中位于所述第一参考数据之后的其他参考数据;
    根据所述首个目标数据及所述第二目标数据的匹配结果生成第一编码序列;其中,所述第一编码序列中包括匹配长度和偏移量,所述匹配长度用于指示本次匹配成功的目标数据的长度,所述偏移量用于指示与本次匹配成功的目标数据相匹配的数据所在的位置。
  16. 如权利要求15所述的装置,其特征在于,所述处理器还用于:
    在所述根据所述第一键值查询第一哈希表之前,按照第一步长,从所述参考块中获取参考数据块;每个参考数据块包括n位参考数据,所述第一子目标块包括n位目标数据,n为正整数;
    构建所述第一哈希表,所述第一哈希表的键值为所述参考数据块通过所述哈希运算获取的。
  17. 如权利要求15或16所述的装置,其特征在于,所述处理器还用于:
    在所述生成第一编码序列之前,将所述目标块中位于所述第一子目标块之前的目标数据与所述参考块中位于所述第一参考数据之前的其他参考数据进行匹配。
  18. 如权利要求17所述的装置,其特征在于,所述第一编码序列中还包括未匹配成功的目标数据,所述未匹配成功的目标数据为上一个编码序列对应的匹配成功的最后一个目标数据,与本次匹配成功的第一个目标数据之间的目标数据。
  19. 如权利要求15-18任一所述的装置,其特征在于,所述处理器还用于:
    在所述对所述第一子目标块进行所述哈希运算得到第一键值之后,根据所述第一键值查询第二哈希表;若在所述第二哈希表中查到与所述第一键值对应的第二哈希值,所述第二哈希表中的键值对应的哈希值指示所述目标块中目标数据的地址,根据所述第二哈希值获取第一目标数据,将所述第一子目标块的首个目标数据与所述第一目标数据进行匹配,将所述目标块中的第二目标数据与所述目标块中的第三目标数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第三目标数据为所述目标块中位于所述第一目标数据之后的其他目标数据,并将所述目标块中位于所述第一子目标块之前的目标数据与所述目标块中位于所述第一目标数据之前的其他目标数据进行匹配,获取第一匹配结果;
    所述将所述第一子目标块的首个目标数据与所述第一参考数据进行匹配,将所述目标块中的第二目标数据与所述参考块中的第二参考数据进行匹配的匹配结果为第二匹配结果;
    从所述第一匹配结果和所述第二匹配结果中选择匹配的目标数据的数据量大的匹配结果;根据选择的匹配结果生成所述第一编码序列;所述第一编码序列中还包括指示位,用于当所述第一匹配结果中匹配的目标数据量大于所述第二匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数 据位于所述目标块,或用于当所述第二匹配结果中匹配的目标数据量大于所述第一匹配结果的情况下,指示与本次匹配成功的目标数据相匹配的数据位于所述参考块。
  20. 如权利要求15-18任一所述的装置,其特征在于,所述处理器还用于:
    若在所述第一哈希表中未查到与所述第一键值对应的哈希值,则在所述生成第一编码序列之前,根据所述第一键值查询第二哈希表;
    若在所述第二哈希表中查到与所述第一键值对应的哈希值,所述第二哈希表中键值对应的哈希值指示目标数据的地址;根据所述第二哈希表中所述第一键值对应的哈希值获取第一目标数据,将所述第一子目标块的首个目标数据与所述第一目标数据进行匹配,将所述目标块中的第二目标数据与所述目标块中的第三目标数据进行匹配,所述第二目标数据为位于所述第一子目标块的首个目标数据之后的其他目标数据,所述第三目标数据为所述目标块中位于所述第一目标数据之后的其他目标数据;
    所述第一编码序列中还包括指示位,用于指示与本次匹配成功的目标数据相匹配的数据位于所述目标块。
  21. 如权利要求19或20所述的装置,其特征在于,所述处理器还用于:更新所述第二哈希表,使得所述第二哈希表中所述第一键值对应的哈希值指示所述第一子目标块的首个目标数据的地址。
PCT/CN2016/099593 2015-10-31 2016-09-21 一种编码方法及装置 WO2017071431A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP16858867.1A EP3361393A4 (en) 2015-10-31 2016-09-21 Encoding method and device
US15/924,007 US10305512B2 (en) 2015-10-31 2018-03-16 Encoding method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510733615.XA CN105426413B (zh) 2015-10-31 2015-10-31 一种编码方法及装置
CN201510733615.X 2015-10-31

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/924,007 Continuation US10305512B2 (en) 2015-10-31 2018-03-16 Encoding method and apparatus

Publications (1)

Publication Number Publication Date
WO2017071431A1 true WO2017071431A1 (zh) 2017-05-04

Family

ID=55504625

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/099593 WO2017071431A1 (zh) 2015-10-31 2016-09-21 一种编码方法及装置

Country Status (4)

Country Link
US (1) US10305512B2 (zh)
EP (1) EP3361393A4 (zh)
CN (1) CN105426413B (zh)
WO (1) WO2017071431A1 (zh)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426413B (zh) 2015-10-31 2018-05-04 华为技术有限公司 一种编码方法及装置
CN105930104B (zh) * 2016-05-17 2019-01-18 百度在线网络技术(北京)有限公司 数据存储方法和装置
CN107690071B (zh) * 2016-08-03 2020-07-07 华为技术有限公司 图像数据的压缩方法和装置
CN107783990B (zh) * 2016-08-26 2021-11-19 华为技术有限公司 一种数据压缩方法及终端
CN106484852B (zh) * 2016-09-30 2019-10-18 华为技术有限公司 数据压缩方法、设备与计算设备
US11126594B2 (en) * 2018-02-09 2021-09-21 Exagrid Systems, Inc. Delta compression
CN110866141A (zh) * 2018-08-28 2020-03-06 杭州网易云音乐科技有限公司 音频文件的处理方法、介质、装置和计算设备
CN112749145A (zh) * 2019-10-29 2021-05-04 伊姆西Ip控股有限责任公司 存储和访问数据的方法、设备和计算机程序产品
CN111259203B (zh) * 2020-01-08 2023-08-25 上海兆芯集成电路股份有限公司 数据压缩器以及数据压缩方法
CN113765854B (zh) * 2020-06-04 2023-06-30 华为技术有限公司 一种数据压缩方法及服务器
CN113542769B (zh) * 2021-09-17 2021-12-10 苏州浪潮智能科技有限公司 一种视频图像编码方法、系统、设备以及介质
CN113921011A (zh) * 2021-10-14 2022-01-11 安徽听见科技有限公司 音频处理方法、装置及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030204703A1 (en) * 2002-04-25 2003-10-30 Priya Rajagopal Multi-pass hierarchical pattern matching
CN102156727A (zh) * 2011-04-01 2011-08-17 华中科技大学 一种采用双指纹哈希校验的重复数据删除方法
CN102870116A (zh) * 2012-06-30 2013-01-09 华为技术有限公司 内容匹配方法和装置
CN103379160A (zh) * 2012-04-25 2013-10-30 上海咏云信息技术有限公司 一种超大文件的差异同步方法
CN105426413A (zh) * 2015-10-31 2016-03-23 华为技术有限公司 一种编码方法及装置

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5532694A (en) * 1989-01-13 1996-07-02 Stac Electronics, Inc. Data compression apparatus and method using matching string searching and Huffman encoding
US5874908A (en) * 1997-09-19 1999-02-23 International Business Machines Corporation Method and apparatus for encoding Lempel-Ziv 1 variants
US7840774B2 (en) * 2005-09-09 2010-11-23 International Business Machines Corporation Compressibility checking avoidance
US7538695B2 (en) * 2007-06-29 2009-05-26 Rmi Corporation System and method for deflate processing within a compression engine
US7809701B2 (en) * 2007-10-15 2010-10-05 Telefonaktiebolaget Lm Ericsson (Publ) Method and system for performing exact match searches using multiple hash tables
US8527482B2 (en) * 2008-06-06 2013-09-03 Chrysalis Storage, Llc Method for reducing redundancy between two or more datasets
US8325069B2 (en) * 2009-12-22 2012-12-04 Intel Corporation System, method, and apparatus for a scalable processor architecture for a variety of string processing applications
US8694703B2 (en) * 2010-06-09 2014-04-08 Brocade Communications Systems, Inc. Hardware-accelerated lossless data compression
US9363339B2 (en) * 2011-07-12 2016-06-07 Hughes Network Systems, Llc Staged data compression, including block level long range compression, for data streams in a communications system
US8954392B2 (en) * 2012-12-28 2015-02-10 Futurewei Technologies, Inc. Efficient de-duping using deep packet inspection
CN110582001B (zh) * 2014-09-30 2022-10-14 微软技术许可有限责任公司 用于视频编码的基于散列的编码器判定
US9419648B1 (en) * 2015-09-18 2016-08-16 Intel Corporation Supporting data compression using match scoring
US9584155B1 (en) * 2015-09-24 2017-02-28 Intel Corporation Look-ahead hash chain matching for data compression
US9473168B1 (en) * 2015-09-25 2016-10-18 Intel Corporation Systems, methods, and apparatuses for compression using hardware and software

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030204703A1 (en) * 2002-04-25 2003-10-30 Priya Rajagopal Multi-pass hierarchical pattern matching
CN102156727A (zh) * 2011-04-01 2011-08-17 华中科技大学 一种采用双指纹哈希校验的重复数据删除方法
CN103379160A (zh) * 2012-04-25 2013-10-30 上海咏云信息技术有限公司 一种超大文件的差异同步方法
CN102870116A (zh) * 2012-06-30 2013-01-09 华为技术有限公司 内容匹配方法和装置
CN105426413A (zh) * 2015-10-31 2016-03-23 华为技术有限公司 一种编码方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3361393A4 *

Also Published As

Publication number Publication date
EP3361393A1 (en) 2018-08-15
EP3361393A4 (en) 2018-11-14
CN105426413B (zh) 2018-05-04
CN105426413A (zh) 2016-03-23
US20180205393A1 (en) 2018-07-19
US10305512B2 (en) 2019-05-28

Similar Documents

Publication Publication Date Title
WO2017071431A1 (zh) 一种编码方法及装置
US11706020B2 (en) Circuit and method for overcoming memory bottleneck of ASIC-resistant cryptographic algorithms
US6624762B1 (en) Hardware-based, LZW data compression co-processor
JP6512733B2 (ja) データ圧縮方法と、該方法を行う装置
US10210044B2 (en) Storage controller, data processing chip, and data processing method
US10158376B2 (en) Techniques to accelerate lossless compression
WO2014067063A1 (zh) 重复数据检索方法及设备
WO2019228098A1 (zh) 一种数据压缩方法及装置
US20160292198A1 (en) A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure
CN116594572B (zh) 浮点数流式数据压缩方法、装置、计算机设备及介质
US20160285476A1 (en) Method for encoding and decoding of data based on binary reed-solomon codes
CN114268323B (zh) 支持行存的数据压缩编码方法、装置及时序数据库
CN110572164B (zh) Ldpc译码方法、装置、计算机设备及存储介质
US9594629B2 (en) Data error correction from cached error correction information
US9197243B2 (en) Compression ratio for a compression engine
CN113778948A (zh) 消息持久化存储方法及装置
JP2022545644A (ja) エンコーディング及びデコーディングテーブルを用いたセミソーティング圧縮
JP6961950B2 (ja) 格納方法、格納装置および格納プログラム
CN108253977B (zh) 用于更新导航数据的增量数据的生成方法及生成装置
US11748307B2 (en) Selective data compression based on data similarity
CN116343890B (zh) 纠错单元管理方法、存储控制芯片及闪存设备
CN117555817A (zh) 数据查询方法、控制器和处理器
CN117290332A (zh) 一种数据迁移校验方法、装置、设备及存储介质
CN117595966A (zh) 一种解比特选择方法、装置、设备和存储介质
CN117200807A (zh) 一种数据处理方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16858867

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2016858867

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2016858867

Country of ref document: EP

Effective date: 20180316