WO2019153700A1 - 编解码方法、装置及编解码设备 - Google Patents

编解码方法、装置及编解码设备 Download PDF

Info

Publication number
WO2019153700A1
WO2019153700A1 PCT/CN2018/100615 CN2018100615W WO2019153700A1 WO 2019153700 A1 WO2019153700 A1 WO 2019153700A1 CN 2018100615 W CN2018100615 W CN 2018100615W WO 2019153700 A1 WO2019153700 A1 WO 2019153700A1
Authority
WO
WIPO (PCT)
Prior art keywords
string
target
source
data
compressed
Prior art date
Application number
PCT/CN2018/100615
Other languages
English (en)
French (fr)
Inventor
李勇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2019153700A1 publication Critical patent/WO2019153700A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3086Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing a sliding window, e.g. LZ77
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6064Selection of Compressor

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to a codec method, apparatus, and codec device.
  • Data compression refers to a technical method of reducing the amount of data to reduce the storage space and improve the efficiency of its transmission, storage and processing without losing information. Or, reorganize the data according to a certain algorithm to reduce the redundancy and storage space of the data.
  • Data compression includes lossy compression and lossless compression. Lossless compression can completely restore the data before compression, and the encoding overhead is smaller than that of lossy compression. It is generally used for compression of desktop text areas.
  • Lz77 dictionary encoding In the field of lossless compression, the Lz77 dictionary encoding algorithm, which was born in 1977, is a milestone event. Lz77 encoding is an open source dictionary compression algorithm that belongs to lossless compression.
  • Lz77 algorithm has been widely used in various data compression processing fields, and various compression algorithms derived from it are also emerging, but they belong to the Lz77 algorithm.
  • Lz77 derivative algorithms such as Lzss, Lzo, Lz4
  • combination algorithms zlib, Lzma, zstd
  • the adopted scheme is to compress and encode compressed data by using a preset compression algorithm.
  • the codec device presets compression encoding of the data to be compressed using the Lz4 compression algorithm.
  • Different compression algorithms use different compression coding rules. Therefore, different compression algorithms are used to encode the same data to be compressed, and the required storage space is different, that is, the compression ratio of the obtained compressed data is different.
  • the present application provides a codec method, which can ensure that a compression ratio is obtained for any compression data to be compressed, and the occupied storage space is reduced.
  • the application provides a codec method, the method comprising:
  • Obtaining source data to be compressed determining a source string and description information according to the source data to be compressed, where the source string is a string that is not compressed in the source data, and the description information is used to describe the compressed string and Corresponding relationship of the source string; selecting a compression algorithm with a small storage space required for encoding the description information in a plurality of preset compression algorithms as a target algorithm, and using the target algorithm to perform the source string and the description information Compressed code to get compressed data.
  • the source data to be compressed may contain two types of strings: one is a string that can be compressed, that is, a string that does not need to be compressed for the first time in the source data; and the other is a character that cannot be compressed.
  • a string a string that appears for the first time in the source data and does not need to be compressed.
  • the string that can be compressed in the source data may be a string that is not first seen in the source data and whose length exceeds a threshold, and the threshold may be 3, 4, 5, 6, or the like.
  • the length of a string is the number of characters the string contains.
  • the threshold is the minimum number of characters contained in the compressed string, that is, the minimum length of the compressed character.
  • the string that cannot be compressed in the source data can be a string that appears for the first time and/or a string that does not exceed the threshold.
  • a compression algorithm that occupies less storage space in the storage space required for encoding the description information is selected as the target algorithm; the target data is used to compress and encode the source data to obtain compressed data;
  • the compression algorithm that requires less storage space is adaptively selected to compress and encode the source data, thereby reducing the storage space required and increasing the compression ratio.
  • the compressed data obtained by compression-coding the source data includes an indication field indicating the target algorithm. That is, the indication field is used to indicate the compression algorithm used to encode the source data.
  • the present invention indicates the compression algorithm used to encode the source data by using the indication field, so that when the compressed data obtained by compression coding the source data is decompressed, the decompression algorithm corresponding to the compression algorithm is used for decompression, and the decompression efficiency is improved.
  • the description information includes a target field, where the target field is used to describe a correspondence between the target string and the source string, and the target string belongs to the compressed string; the target field includes a first value, a second value, and a third value; the first value represents a positional relationship between the target string and the source string; the second value represents a starting position of the target string in the source string; The third value represents the length of the target string.
  • the description information may include one or more (including two) fields, and each field describes a correspondence between a compressed character string and the source string.
  • the source string is a string in the source data that cannot be compression encoded.
  • the source string can be a string that first appears in the source data and/or a string that does not exceed the threshold.
  • the string in the source data that precedes the target string contains the target string, that is, the target string is a non-first occurrence string in the source data.
  • the length of the target string exceeds the threshold.
  • the Lz4 lossless compression algorithm uses "offset value-match length", ie ⁇ offset, length>, to replace the previously appearing string, and then express it in the code stream in a specific unambiguous form.
  • the Lz4 lossless compression algorithm is used for compression coding.
  • the source data to be compressed is AAAABCDAAAA, and the threshold is 4; the repeated string in the source data is AAAA, and the corresponding offset value and matching length are ⁇ 7, 4 >; the source data can be represented as AAAABCD ⁇ 7,4>, the previous "AAAABCD" is the source string in the source data, ⁇ 7,4> can be the description information, and the description information and the source string can be Get "AAAA”.
  • the basic idea of compression coding provided by the present application may be to replace the recurring character string with description information associated with the source string in the source data.
  • the target field is used to describe a correspondence between the target string and the source string, and the target string can be decoded by the source string and the target field.
  • the target field is used to describe the correspondence between the target character string and the source character string, so that the target character string is accurately determined by using the target field and the source string, and the coding efficiency is high.
  • the description information may include description information corresponding to two or more compressed character strings in the source data, and each compressed character string corresponds to one description information. That is, the description information includes at least one description information corresponding to the compressed character string.
  • the compressed data of the first data segment can be quickly generated by using the target field and the source string, which is simple to implement and has high coding efficiency.
  • the target field and the second character string as a piece of information, and obtaining the target information, where the second character string is a character string adjacent to the target character string in the source data and is located before the target character string; acquiring the target information; Using the target algorithm to generate compressed data of the second data segment according to the first value, the second value, the third value, and the second string; the second data segment includes the second string and the target string .
  • the compressed data of the second data segment can be quickly generated by using the target field, which is simple to implement and can save coding time.
  • the method further includes:
  • the present application determines the decompression algorithm used to decompress the compressed data by parsing the indication field of the compressed data, and can complete the decompression operation accurately and quickly.
  • the target character string is determined to be a string that can be compression-encoded; the first value, the second value, and the third value are obtained, and the target field is generated.
  • the application uses a hash algorithm to search for a string that can be encoded in the source data, and generates corresponding description information, and the time overhead is small.
  • the present application provides a codec device, where the codec device includes:
  • An obtaining unit configured to obtain source data to be compressed
  • a determining unit configured to determine a source string and description information according to the source data to be compressed;
  • the source string is a string that is not compressed in the source data, and the description information is used to describe the compressed string and the source The correspondence of strings;
  • a calculating unit configured to separately calculate a storage space required for encoding the description information by at least two compression algorithms
  • a selection unit configured to select, as the target algorithm, a compression algorithm that occupies less storage space in the storage space required by the at least two compression algorithms to encode the description information
  • a coding unit configured to compress and encode the source data by using the target algorithm to obtain compressed data.
  • the present application selects a target algorithm that requires less storage space for encoding the source data before compressing and encoding the source data; and compresses and encodes the source data by using the target algorithm; it may not significantly increase the encoding time. Under the condition, the compression ratio is obviously increased, and the occupied storage space is reduced.
  • the compressed data includes an indication field indicating the target algorithm.
  • the present invention indicates the compression algorithm used to encode the source data by using the indication field, so that when the compressed data obtained by compression coding the source data is decompressed, the decompression algorithm corresponding to the compression algorithm is used for decompression, and the decompression efficiency is improved.
  • the description information includes a target field, where the target field is used to describe a correspondence between the target string and the source string, and the target string belongs to the compressed string; the target field The first value, the second value, and the third value are included; the first value represents a positional relationship between the target string and the source string; the second value represents a starting position of the target string in the source string The third value represents the length of the target string.
  • the target field is used to describe the correspondence between the target string and the source string, so that the target string is accurately determined by using the target field and the source string, and the encoding efficiency is high.
  • the codec device further includes:
  • a first storage unit configured to separately store the source string and the description information
  • the coding unit is specifically configured to obtain the target field from the description information, and obtain a first character string from the source string; the first character string is a string adjacent to the target character string in the source data and is located Before the target character string; using the target algorithm to generate compressed data of the first data segment according to the first value, the second value, the third value, and the first string, the first data segment including the first character String and the target string.
  • the compressed data of the first data segment can be quickly generated by using the target field and the source string, which is simple to implement and has high coding efficiency.
  • the codec device further includes:
  • a second storage unit configured to store the target field and the second string as a piece of information, to obtain target information;
  • the second string is a string adjacent to the target string in the source data and located in the target string prior to;
  • the coding unit is specifically configured to acquire the target information, and use the target algorithm to generate compressed data of the second data segment according to the first value, the second value, the third value, and the second string; the second data
  • the fragment contains the second string and the target string.
  • the compressed data of the second data segment can be quickly generated by using the target field, which is simple to implement and can save coding time.
  • the codec device further includes:
  • a parsing unit configured to parse the compressed data, to obtain the target algorithm indicated by the indication field
  • a decoding unit configured to decompress the compressed data by using a decompression algorithm corresponding to the target algorithm, to obtain the source data.
  • the present application determines the decompression algorithm used to decompress the compressed data by parsing the indication field of the compressed data, and can complete the decompression operation accurately and quickly.
  • the acquiring unit is specifically configured to sequentially obtain a string that first appears in the source data according to a sequence in the source data, to obtain the source string, and search by using a hash algorithm. a string matching the target string in the source data; if the reference string matching the target string is searched, determining that the target string is a string that can be compression-encoded; generating the Target field.
  • the application uses a hash algorithm to search for a string that can be encoded in the source data, and generates corresponding description information, and the time overhead is small.
  • the present application provides a codec device including a processor and a memory, wherein the processor and the memory are connected to each other, wherein the memory is used to store a computer program, the computer program includes program instructions, and the processing The apparatus is configured to invoke the program instructions to perform the method of any of the first aspect and the optional implementation of the first aspect.
  • the codec device can compress and encode the source data by selecting a compression algorithm with a small storage space required for encoding the source data from a plurality of compression algorithms set in advance, thereby reducing the storage space required and improving Compression ratio.
  • the present application provides a computer readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to execute The method of any of the above aspects of the first aspect and the first aspect.
  • the compression algorithm that requires less storage space for encoding the source data can be adaptively selected in the compression coding process to compress and encode the source data, thereby reducing the occupied storage space and increasing the compression ratio.
  • the present application may further combine to provide more implementations.
  • 1 is a schematic diagram showing the structure of a code stream unit used by the Lz4 compression algorithm
  • FIG. 2 is a schematic diagram showing a structure of a code stream unit generated by using an Lz4 compression algorithm
  • FIG. 3 is a schematic diagram showing a structure of a code stream unit generated by a data-shrinker compression algorithm
  • FIG. 4 is a schematic diagram showing a structure of a code stream unit generated by a Lizard compression algorithm
  • FIG. 5 is a schematic diagram of a coding process in the prior art
  • FIG. 6 is a schematic flowchart of a codec method provided by the present application.
  • FIG. 7 is a schematic structural diagram of another code stream unit generated by using an Lz4 compression algorithm
  • FIG. 8 is a schematic structural diagram of still another code stream unit generated by using an Lz4 compression algorithm
  • FIG. 9 is a schematic structural diagram of still another code stream unit generated by using an Lz4 compression algorithm
  • FIG. 10 is a schematic flowchart of another codec method provided by the present application.
  • FIG. 11 is a schematic structural diagram of a codec device provided by the present application.
  • FIG. 12 is a schematic structural diagram of a codec device provided by the present application.
  • Lz4 lossless compression algorithm is a lossless compression algorithm derived from Lz77 compression algorithm.
  • the compression principle of the Lz4 lossless compression algorithm is exactly the same as that of the Lz77 compression algorithm. It uses the "offset value-matching length", that is, ⁇ offset, length> instead of the symbol that has appeared, and then expresses it in a specific unambiguous form.
  • the data to be compressed is AAAABCDAAAA, denoted as AAAABCD ⁇ 7,4>; wherein ⁇ 7,4> replaces the following "AAAA" in the data to be compressed.
  • AAAABCD ⁇ 7,4> can be understood as a representation of the encoded information corresponding to the data to be compressed.
  • To compress and encode the source data to be compressed by using the Lz4 lossless compression algorithm first obtain the offset value, matching length, character length, and source string; then obtain the offset value, matching length, character length, and source string specific. The form is written to the code stream. The code stream is written in a specific form, and the offset value, the matching length, the character length, and the source string obtained by the compression coding rule encoding corresponding to the Lz4 lossless compression algorithm are obtained, and the corresponding code stream unit is obtained.
  • the encoded code stream units form a code stream.
  • each code stream unit corresponds to compressed data of one data segment in the source data, and these sequentially generated code stream units may be Understood as code stream.
  • the source data to be compressed includes a plurality of data segments, and each code stream unit corresponds to compressed data of one data segment.
  • the process of compressing and encoding the compressed source data is a process of compressing and encoding each data segment to obtain a code stream unit corresponding to each data segment.
  • the offset value indicates the starting position of the matching data, that is, the starting position of the compressed character string in the source string.
  • the matching data is the same string in the source string as the compressed string.
  • the source data is "AAAABCDAAAA”
  • the last four digits of "AAAA” in the source data are compressed strings
  • "AAAA” in the first to fourth bytes of the source data is matching data.
  • the length of the character indicates the length of a character string that cannot be compression-encoded adjacent to the compressed character string.
  • the source string represents an unencoded string in the source data.
  • the matching length can be expressed as Match length
  • the character length can be expressed as Literal length.
  • FIG. 1 is a schematic diagram showing the structure of a code stream unit used in the Lz4 compression algorithm.
  • the first field 101 contains all or part of the information of the matching length and the character length, that is, part or all of the information of Literal length and Match length, the first field occupies one byte;
  • the second field 102 represents the character. If the length of the character exceeds 15 bytes, the second field 102 does not exist;
  • the third field 103 is an uncoded string, that is, the source string; the fourth field 104 occupies 2 words.
  • the section is used to record the offset value;
  • the fifth field 105 indicates the part whose matching length exceeds 15 bytes, and if the matching length does not exceed 15 bytes, the fifth field 105 does not exist.
  • This first field can be represented as a "Token" control byte. It can be understood that if the character length exceeds 15 bytes, the code stream unit includes a second field 102 indicating that the character length exceeds 15 bytes; otherwise, the code stream unit does not include the second field 102. If the matching length exceeds 15 bytes, the code stream unit includes a fifth field 105 indicating that the matching length exceeds 15 bytes; otherwise, the code stream unit does not include the fifth field 105.
  • the first field 101 contains all the information of the matching length and the character length; otherwise, the first field 101 contains partial information of the matching length and the character length, and the remaining information
  • the information is contained in the fifth field and the second field, respectively.
  • a field contains some or all of the information of the matching length and character length, that is, part or all of the information of Literal length and Match length.
  • the length of the character is less than 15, that is, Literal length ⁇ 15, it will be written to the upper 4 bits of the first field, and there will be no more bytes to indicate the length of the character. Otherwise, the length of the character exceeds 15 The portion of the byte will continue to be represented in the second field 102 using a prefix encoding.
  • Match length If the matching length is less than 15, that is, Match length ⁇ 15, it will be written to the lower 4 bits of the first field, and there will be no more bytes to indicate the matching length; otherwise, the matching length exceeds 15 words. The portion of the section will continue to be represented in the fifth field 105 using a prefix encoding.
  • the second field 102 and the fifth field 105 adopt the same prefix encoding form, that is, on the basis of more than 15, each time over 255, one 0xFF byte is added until the last less than 255, and the writing ends. .
  • the third field 103 inputs the unencoded string as it is.
  • the third field 103 is obtained by inputting the source string in the source data into the stream unit. That is, the third field 103 stores a character string that is not encoded, that is, a source character string.
  • the fourth field 104 is fixed to occupy 2 bytes for recording the offset value.
  • the character string "AAAABCDAAAA” is compression-encoded by the Lz4 compression algorithm, and the code stream unit shown in FIG. 2 is obtained.
  • 2 is a schematic diagram showing the structure of a code stream unit obtained by compression-coding a source data by using an Lz4 compression algorithm.
  • 202 stores the unencoded character string, that is, The source string "AAAABCD";
  • 203 is fixed to occupy two bytes and store the offset value.
  • various forms of data are represented in bits in a code stream unit.
  • the bits corresponding to the respective fields are expressed in hexadecimal.
  • the match length is represented as Match length
  • the character length is represented as Literals length
  • the offset value is represented as Offset.
  • the specific calculation rules are as follows:
  • the number of output bytes 1 + ( ⁇ (Literals length-15)/255 ⁇ +1)+Literals length+2+0.
  • the number of output bytes 1 + 0 + Literals length + 2 + (Match length -15 / 255 ⁇ +1).
  • the number of output bytes 1 + ( ⁇ (Literals length-15)/255 ⁇ +1)+Literals length+2+( ⁇ (Match length-15)/255 ⁇ +1); where “ ⁇ x ⁇ ” is Refers to the smallest integer that does not exceed x.
  • the byte output model of the Lz4 compression algorithm can accurately and quickly calculate the storage space required to encode the source data using the Lz4 compression algorithm. It can be understood that, before generating the offset value, the matching length, and the code stream unit corresponding to the character length, the byte output model can be used to determine the storage space required to generate the code stream unit.
  • the data-shrinker compression algorithm is another open source lossless compression algorithm similar to Lz4, which is roughly the same as Lz4 in the design of the code stream unit. The difference lies in:
  • the first field containing part or all of the information of the matching length and the length of the character, that is, the 8 bits in the "Token" control byte are changed from "4+4" to "3+1+4", and the 1 is separated.
  • the bits are used to indicate whether the fourth field occupies one byte or two bytes, that is, whether Offset is 1 byte of data or 2 bytes of data; the fourth field is used to record the offset value.
  • the fourth field that is, the field for recording the offset value, is no longer fixed to 2 bytes.
  • the offset value is less than 256, that is, Offset ⁇ 256, 1 byte is occupied to indicate the offset value, and 2 bytes are occupied to represent the offset value.
  • the prefix encoding mode of the character length changes. If the character length is less than 7, that is, Literal length ⁇ 7, it will be written to the upper 3 bits of the first field, and there are no more bytes to represent the character. The length, on the other hand, the part whose character length exceeds 7 bytes will continue to be represented by the prefix encoding in the second field.
  • the matching length is exchanged with the writing order of the source string, that is, the order of the third field and the fifth field in the code stream unit.
  • the character string "AAAABCDAAAA” in the example is compression-encoded by the data-shrinker compression algorithm, and the code stream unit shown in FIG. 3 is obtained.
  • 301 occupies one byte, and the upper 3 bits indicate the character length 7, which is the Literal length, and the lower 4 bits indicate the character length 4, that is, the Match length, and the fourth bit is 0, indicating that the offset value occupies one byte.
  • 302 indicates the part whose matching length exceeds 7 bytes, that is, the part where Literal length exceeds 7 bytes
  • 304 stores the unencoded character string, that is, the source string "AAAABCD" ".
  • various forms of data are represented by bits in the code stream unit. In order to clearly display the respective fields in FIG. 3, the bits corresponding to the respective fields are expressed in hexadecimal.
  • the Lz4 compression algorithm is equivalent to the code stream length obtained by the data-shrinker compression algorithm.
  • the data-shrinker compression algorithm has the following disadvantages and advantages over the Lz4 compression algorithm:
  • One bit in the first field is used, that is, one bit in the "Token" control byte is used, causing the data-shrinker compression algorithm to reduce the encoding ability of the character length.
  • the Lz4 compression algorithm only needs 1 byte to represent, but in the data-shrinker compression algorithm, 2 bytes are needed.
  • the encoding capability of the data-shrinker compression algorithm is enhanced on the recording offset value.
  • the data-shrinker compression algorithm is used for compression coding without causing byte waste in the Lz4 compression algorithm. Comparing FIG. 2 and FIG. 3, it can be seen that the data-shrinker compression algorithm occupies 1 byte record offset value, and the Lz4 compression algorithm occupies 2 bytes record offset value.
  • the number of output bytes 1 + 0 + 1 + 0 + Literals length.
  • the number of output bytes 1 + 0 + 2 + 0 + Literals length.
  • the number of output bytes 1 + ( ⁇ (Literals length-7)/255 ⁇ +1)+1+0+Literals length.
  • the number of output bytes 1 + ( ⁇ (Literals length-7)/255 ⁇ +1)+2+0+Literals length.
  • Output byte number 1 + 0 + 1 + ( ⁇ (Match length-15) / 255 ⁇ +1) + Literals length.
  • the number of output bytes 1 + 0 + 2 + (Match length - 15 / 255 +1 +) + Literals length.
  • Output byte number 1 + ( ⁇ (Literals length-7)/255 ⁇ +1)+1+( ⁇ (Match length-15)/255 ⁇ +1)+Literals length.
  • Output byte number 1 + ( ⁇ (Literals length-7)/255 ⁇ +1)+2+( ⁇ (Match length-15)/255 ⁇ +1)+Literals length.
  • the byte output model of the data-shrinker compression algorithm can accurately and quickly calculate the storage space required to encode the source data using the data-shrinker compression algorithm. It can be understood that, before generating the offset value, the matching length, and the code stream unit corresponding to the character length, the byte output model of the data-shrinker compression algorithm can be used to determine the source code generation code by using the data-shrinker compression algorithm. The storage space required by the stream unit.
  • the Lizard compression algorithm (formerly known as "Lz5") is another open source lossless compression algorithm similar to the Lz4 compression algorithm. Compared with the Lz4 compression algorithm, the Lizard compression algorithm has a feature similar to "entropy coding" in designing the first field, the "Token" control byte. Entropy coding is the encoding that does not lose any information according to the entropy principle in the encoding process.
  • the Lizard compression algorithm and the Lz4 compression algorithm mainly have the following differences:
  • the first is 1OO LL MMM:
  • the offset value is between 0-1023, the highest bit is 1; the second and third bits are connected with a byte occupied by the later record offset value for a total of 10 bits, and the offset value is recorded together;
  • the 2 bits occupied by the fifth bit represent the prefix encoding of the character length, that is, the prefix encoding of Literal length;
  • the 3 bits occupied by the sixth to eighth bits represent the prefix encoding of the matching length, that is, the prefix encoding of Match length.
  • the "Token" control byte occupies one byte and one byte contains 8 bits.
  • the second type is 00LLL MMM:
  • the offset value is between 1024 and 65535, the highest 2 bits are 00; the offset value is recorded by two bytes during encoding; the third to fifth bits occupy 3 bits indicating the prefix length of the character length, ie Literal length Prefix encoding; the 3 bits occupied by the sixth to eighth bits represent the prefix encoding of the matching length, that is, the prefix encoding of Match length.
  • the third is 01LLL MMM:
  • the offset value is between 65536-16777215, the highest 2 bits are 01; the offset value is recorded by 3 bytes during the encoding process; the third to fifth bits occupy 3 bits indicating the prefix length of the character length, ie Literal length Prefix encoding; the 3 bits occupied by the sixth to eighth bits represent the prefix encoding of the matching length, that is, the prefix encoding of Match length.
  • the offset value is no longer fixed to 2 bytes, and is marked by the first field, that is, the "Token” controls the byte associated tag, occupying 1-3 bytes.
  • the "Token" control byte is not 1OO LL MMM, if the character length is less than 7, that is, Literal length ⁇ 7, it will be written to the third to fifth digits of the "Token" control byte, and There is no longer a byte to indicate the length of the character. Otherwise, the part whose character length exceeds 7 bytes will be represented by the prefix encoding in the "Literal length+" part;
  • Match length If the matching length is less than 7, that is, Match length ⁇ 7, it will be written to the sixth to eighth bits of the "Token" control byte, and there will be no more bytes to indicate the matching length, and vice versa.
  • the part matching the length of more than 7 bytes will continue to be represented by the prefix encoding in the "Match length+" part; the "Match length+” corresponds to the fifth field in Figure 1.
  • the character string "AAAABCDAAAA” in the example is compression-encoded using the Lizard compression algorithm to obtain a code stream unit as shown in FIG.
  • the first field 401 is "0x9C”
  • the corresponding 8 bits are "10011100”
  • the second and third bits are connected with a byte (404) occupied by the subsequent recording offset value by a total of 10 bits.
  • the offset value is recorded jointly; the 2 bits occupied by the fourth and fifth bits represent the prefix encoding of the character length; the 3 bits occupied by the sixth to eighth bits represent the prefix encoding of the matching length; and the second field 402 represents the character length.
  • the second field 402 is "0x04", the corresponding 8 bits are "00000100"; 403 represents an unencoded character string, that is, a source string; 404 occupies one byte for recording bias Move the value.
  • 2 bits occupied by the second to third bits of the first field 401 and 10 bits composed of 8 bits occupied by 404 represent an offset value of 7; the sixth bit of the first field 401 3 bits occupied by the eighth bit, that is, "100", indicating a matching length of 4; 2 bits occupied by the fourth to fifth bits of the first field 401, that is, "11", and the second field 402 is occupied
  • the 8 bits, the sum of "00000100" represent a character length of 7.
  • the Lizard compression algorithm is equivalent to the length of the code stream obtained by the Lz4 compression algorithm and the data-shrinker compression algorithm.
  • the "Token" control byte of the Lizard compression algorithm has the feature of "entropy coding", and its code stream is more complicated than the Lz4 compression algorithm and the data-shrinker compression algorithm.
  • the Lizard compression algorithm is slower than the Lz4 compression algorithm and the data-shrinker compression algorithm; the compression ratio is higher. In general, because the short match often occurs, the offset value, the matching length, and the character length are both small, and the Lizard compression algorithm encodes the source data to be compressed to occupy less storage space.
  • the number of output bytes 1 + 0 + 1 + 0 + Literals length.
  • the number of output bytes 1 + ( ⁇ (Literals length-3)/255 ⁇ +1)+1+0+Literals length.
  • Output byte number 1 + 0 + 1 + ( ⁇ (Match length-7) / 255 ⁇ +1) + Literals length.
  • Output byte number 1 + ( ⁇ (Literals length-3)/255 ⁇ +1)+1+( ⁇ (Match length-7)/255 ⁇ +1)+Literals length.
  • the number of output bytes 1 + 0 + 2 + 0 + Literals length.
  • the number of output bytes 1 + ( ⁇ (Literals length-7)/255 ⁇ +1)+2+0+Literals length.
  • the number of output bytes 1 + 0 + 2 + (Match length-7) / 255 ⁇ +1) + Literals length.
  • Output byte number 1 + ( ⁇ (Literals length-7)/255 ⁇ +1)+2+( ⁇ (Match length-7)/255 ⁇ +1)+Literals length.
  • the number of output bytes 1 + 0 + 3 + 0 + Literals length.
  • the number of output bytes 1 + ( ⁇ (Literals length-7)/255 ⁇ +1)+3+0+Literals length.
  • the number of output bytes 1 + 0 + 3 + ( ⁇ (Match length-7) / 255 ⁇ +1) + Literals length.
  • the number of output bytes 1 + ( ⁇ (Literals length-7)/255 ⁇ +1)+3+( ⁇ (Match length-7)/255 ⁇ +1)+Literals length.
  • the byte output model of the Lizard compression algorithm can accurately and quickly calculate the storage space required to encode the source data using the Lizard compression algorithm. It can be understood that, before generating the offset value, the matching length, and the code stream unit corresponding to the character length, the byte output model of the Lizard compression algorithm can be used to calculate the code stream unit required to encode the source data by using the Lizard compression algorithm. Occupied storage space.
  • the calculation of the compression algorithm coding description information by the byte output model is an inventive point of the present application. In the conventional compression algorithm, the storage space required for encoding the description information or the source data is not calculated.
  • the traditional Lz4 compression algorithm, data-shrinker compression algorithm and Lizard compression algorithm encoding rules, code stream unit structure, byte output model, etc. are introduced.
  • the compression ratios obtained by the traditional Lz4 compression algorithm, data-shrinker compression algorithm and Lizard compression algorithm for compressing different types of data to be compressed are not fixed. It can be understood that different compression algorithms may occupy different storage spaces for encoding source data, and any compression algorithm is difficult to ensure that a compression ratio is obtained for any source data. It can be understood that, before compressing and encoding the source data, the byte output model corresponding to the available compression algorithm can be used to calculate the storage space required to encode the source data.
  • the source string and semantic information for obtaining the source data may be acquired during the process of traversing the source data.
  • the semantic information is the encoded information corresponding to the compressed string. For example, the offset value and the matching length in the Lz4 lossless compression algorithm.
  • the compression algorithm of the Lz77 series calculates the ⁇ offset value, matching length> in the process of traversing the source data to generate semantic information, and then encodes the semantic information through the specified encoding rules to obtain each code stream unit, each code.
  • the stream unit forms a code stream.
  • the code stream is created.
  • the currently used compression coding method uses a preset compression algorithm to compress and encode the source data, and before completing the compression coding, the storage space required to encode the source data cannot be determined.
  • a compression algorithm is difficult to ensure that a compression ratio is obtained for any type of source data, which results in a large storage space when encoding certain types of source data.
  • the present application provides a codec method, and the main principle thereof includes: determining a source string and description information according to source data to be compressed in the process of traversing source data, And storing the source string and the description information; calculating, before encoding the source data, a storage space required by the preset at least two compression algorithms to encode the description information; and selecting a compression algorithm that requires less storage space Compressing and encoding according to the stored source string and the description information, and obtaining compressed data.
  • the present application can compare the size of the storage space required by each compression algorithm, and then select the algorithm that takes less storage space according to the service requirement to compress the compressed data, thereby improving the compression ratio of the data to be compressed, and saving Storage space ensures compression.
  • An embodiment of the present invention provides a codec method, as shown in FIG. 6, including:
  • the source string is a string that is not compressed in the source data, and the description information is used to describe the compressed character. The correspondence between the string and the above source string.
  • the source data to be compressed may be text data, image data, audio data, or the like.
  • the compressed character string may be a character string that is not first seen in the source data and whose length exceeds a threshold.
  • the threshold may be 3, 4, 5, 6, or the like.
  • the source string may be a string that appears for the first time in the source data and/or whose length does not exceed the threshold. It can be understood that the compressed character string can be obtained by the above description information and the above source string.
  • the above-mentioned description information is used instead of the repetitive character string, that is, the compressed character string is replaced by the above description information, thereby implementing compression coding of the source data.
  • the source data to be compressed is “AAAABCDAAAA”, the threshold is 4, the source string is “AAAABCD”, the compressed string is “AAAA” in the source data, and the description information is (7, 7). , 4).
  • the first value in the description information indicates the positional relationship between the compressed character string and the source string
  • the second value indicates the starting position of the compressed character string in the source string
  • the third The value represents the length of the compressed string. It can be seen that the 7th character of the source string is followed by the compressed string; 7 bytes from the beginning of the compressed string are the compressed string in the source string. The starting position, the length of the compressed string is 4.
  • the source data to be compressed is “AAAABCDAAAABEFBCDAEFBC”
  • the threshold value is 4
  • the source string is “AAAABCDEF”
  • the compressed string is the second occurrence of “AAAAB” and “BCDA” in the source data.
  • EFBC the three descriptions corresponding to the three compressed strings are (7, 7, 5), (2, 10, 4), (0, 6, 4).
  • the first 7 characters in the source data appear for the first time and cannot be compression-encoded and belong to the source string; the first occurrence of "EF” cannot be encoded and belongs to the source string; therefore, the source string For "AAAABCDEF".
  • the 7th character of the source string is the second occurrence of "AAAAB” in the source data, starting from the starting position of the second occurrence of "AAAAB” in the source data.
  • 7 bytes are the starting position of the "AAAAB” in the source string, the length of the second occurrence of "AAAAB” in the source data is 5; the (7+2)th character in the source string
  • the second occurrence of "BCDA” in the source data that is, the "EF” in the source string is the "BCDA”
  • the starting position of the second occurrence of "BCDA” in the source data is 10
  • the byte is the starting position of the "BCDA” in the source string, the length of the second occurrence of "BCDA” in the source data is 4; the (7+2+0) of the source string
  • the character is followed by the second occurrence of "EFBC” in the source data, that is, the "EFBC” is not adjacent to the source string, and the starting position of the "EFBC” is 6 bytes
  • the starting position in the source string, the length of the second occurrence of "EFBC” in the source data is 4.
  • the first value in the above description information can be understood as the length of the source string adjacent to the compressed character string.
  • the source string adjacent to the second occurrence of the "BCDA” in the source data is "EF”
  • the first value of the description information corresponding to the "BCDA” is 2, that is, the length of the "EF”
  • the second occurrence of "EFBC” in the source data is not a neighboring source string, and the first value of the description information corresponding to the "EFBC” is 0.
  • the encoding of the source data by using a compression algorithm requires obtaining a source string and description information of the source data, and encoding the obtained source string and the description information to obtain compressed data.
  • the source string and the description information of the source data may be obtained in multiple manners, and the description relationship between the compressed string and the source string may be described by using various forms of description information.
  • the embodiment of the present invention does not limit the manner in which the source string and the description information are obtained, and the specific form of the above description information.
  • An embodiment of the present invention provides a method for obtaining a source string and description information of source data, where the source string and the description information of the source data are:
  • the target character string is a character string that can be compression-encoded
  • the first value represents a positional relationship between the target character string and the source string
  • the second value represents the target string in the source string
  • the starting position in the middle; the third value above indicates the length of the target string.
  • a hash algorithm may be used to search for a source string in the source data and a string that can be compressed.
  • the hash algorithm is used to calculate the hash value of each character string in the source data, and the matching of each character string is determined by comparing the hash values, thereby obtaining the source string and the target field.
  • the following uses the hash algorithm in the Lz4 compression algorithm to search for the source string in the source data and the string that can be compressed as an example:
  • the source data of 4 bytes is "AAAA"
  • the ASCII code of A is 0x41
  • the number of bytes fetched from the source data each time is equal to the threshold.
  • the threshold of the compressed string is 4, that is, the minimum length of the compressed string is 4.
  • “>>19” means to move 19 bits to the right.
  • a hash algorithm is used to search for a string that can be encoded in the source data, and corresponding description information is generated, and the time overhead is small.
  • the at least two compression algorithms may include an Lz4 compression algorithm, a data-shrinker compression algorithm, and a Lizard compression algorithm.
  • the codec device may preset a byte output model corresponding to the at least two compression algorithms and the at least two compression algorithms, for example, a byte output model of the Lz4 compression algorithm, a byte output model of the data-shrinker compression algorithm, and a Lizard.
  • the above codec device may be a mobile phone, a computer, a tablet computer, and other devices capable of implementing a codec function.
  • the embodiments of the present invention do not limit the above two algorithms.
  • the storage space occupied by the at least two compression algorithms for encoding the description information may be separately calculated by using the byte output models respectively corresponding to the at least two compression algorithms. Specifically, the values corresponding to the foregoing description information are respectively substituted into the byte output models corresponding to the at least two compression algorithms, and the storage space occupied by the at least two compression algorithms for encoding the description information is obtained.
  • the compression algorithm that occupies a small storage space in the storage space occupied by the at least two compression algorithms to encode the foregoing description information may be a non-maximum storage space occupied by the storage space occupied by the at least two compression algorithms for encoding the foregoing description information.
  • the compression algorithm is the above target algorithm. For example, in the first compression algorithm to the fifth compression algorithm, the first compression algorithm needs to occupy the largest storage space for encoding the description information, and any compression algorithm other than the first compression algorithm may be selected. As the target algorithm.
  • the compressing and encoding the source data by using the foregoing target algorithm, and obtaining the compressed data may be performing compression coding on the source string and the description information by using an encoding manner corresponding to the target algorithm, to obtain compression corresponding to the source data.
  • the source data may be traversed only once, and the source string and the description information are obtained and stored; after the target algorithm is determined, the target algorithm is adopted.
  • the encoding mode encodes the source string and the description information to obtain compressed data corresponding to the source data. In the case that the source data to be compressed and the threshold are determined, the source data to be compressed only corresponds to a certain source string and description information.
  • the description information and the source string according to different compression algorithms for encoding the same data to be compressed are the same, and each compression algorithm can perform compression coding according to the description information and the source string to obtain corresponding compressed data. That is to say, in the embodiment of the present invention, only the source data needs to be traversed once to obtain the description information and the source string.
  • the compression ratio is obviously improved.
  • only the storage space required for encoding the foregoing description information by the at least two compression algorithms is calculated, and the compression operation is not performed by using a compression algorithm other than the foregoing target algorithm, and the coding overhead is small.
  • the compressed data includes an indication field, where the indication field indicates the target algorithm.
  • the above indication field may occupy at least one bit to indicate the above target algorithm. It can be understood that the binary sequence corresponding to the above indication field is different, and the indicated compression algorithm is different.
  • the codec device is preset with four compression algorithms, that is, compression coding can be performed by using any one of four compression algorithms, the indication field occupies two bits, 00 indicates the first compression algorithm, and 01 indicates the second.
  • the compression algorithm, 10 indicates a third compression algorithm, 11 indicates a fourth compression algorithm, and if the target algorithm is the fourth compression algorithm, the indication field is 11.
  • the codec device presets 8 compression algorithms, the indication field occupies 3 bits, 000 indicates the first compression algorithm, 001 indicates the second compression algorithm, 010 indicates the third compression algorithm, and 011 indicates the fourth compression.
  • the algorithm 100 indicates a fifth compression algorithm, 101 indicates a sixth compression algorithm, 110 indicates a seventh compression algorithm, 111 indicates an eighth compression algorithm, and if the target algorithm is the fourth compression algorithm, the indication field is 011.
  • the compression algorithm used to encode the source data is indicated by the indication field, so that when the compressed data obtained by compression coding the source data is decompressed, the decompression algorithm corresponding to the compression algorithm is used for decompression, thereby improving the decompression efficiency.
  • the description information includes a target field, where the target field is used to describe a correspondence between the target string and the source string, where the target string belongs to the compressed string; the target field includes a first value, a second value, and a third value; the first value represents a positional relationship between the target character string and the source string; and the second value represents a starting position of the target string in the source string;
  • the third numerical value described above indicates the length of the target character string.
  • the above target string can be obtained by the above target field and the above source string. Therefore, the above target field can be replaced with the above target string.
  • the first numerical value indicates a positional relationship between the target character string and the source character string, and the position of the target character string in the source data can be determined by the first numerical value.
  • the target character string can be determined by the second value and the third value described above. Specifically, the characters in the source string of the third value are obtained from the starting position indicated by the second value to obtain the target character string.
  • the source data to be compressed is “AAAABCDAAAABEFBCDAEFBC”
  • the threshold value is 4
  • the source string is “AAAABCDEF”
  • the target string is the second occurrence of “AAAAB” in the source data
  • the target field is (7, 7,5).
  • the 7th character of the source string is the target string
  • 7 bytes from the start position of the target string are the target string in the source string.
  • the starting position the target string has a length of 5.
  • the first numerical value can also be understood as the length of the source character string adjacent to the target character string, that is, the length of the uncoded character string adjacent to the front of the target character string.
  • the mapping between the target string and the source string may be described in other forms of the target field, which is not limited in the embodiment of the present invention.
  • the correspondence between the target character string and the source character string is described by using the target field, so that the target character string is accurately and quickly determined by using the target field and the source string, and the coding efficiency is high.
  • the storage method of the source string and the description information can be implemented in any of the following ways:
  • Manner 1 Store the above source string and the above description information separately.
  • the source data to be compressed is “AAAABCDAAAABEFBCDAEFBC”
  • the threshold is 4
  • the source string is “AAAABCDEF”
  • the compressed string is the second occurrence of “AAAAB” and “BCDA” in the source data.
  • EFBC the three descriptions corresponding to the three compressed strings are (7, 7, 5), (2, 10, 4), (0, 6, 4); respectively, the source character is stored.
  • the string "AAAABCDEF” and the three descriptions, the stored description information is (7,7,5), (2,10,4), (0,6,4).
  • the storage may be sequentially performed in the order in which the description information is generated in the above description information. It can be understood that in this manner, the above source string and the above description information are stored in different storage spaces.
  • an implementation manner of compressing the encoded source data is as follows: obtaining the target field from the foregoing description information; acquiring a first character string from the source string; the first character string is the foregoing in the source data. a character string adjacent to the target character string and located before the target character string; using the above target algorithm to generate compressed data of the first data segment according to the first value, the second value, the third value, and the first character string,
  • the first data segment includes the first character string and the target character string.
  • the obtaining the first character string from the source string may be: extracting the first value character from the first unextracted character string in the source string to obtain the first character string.
  • the source data to be compressed is "AAAABCDAAAABEFBCDAEFBC”
  • the threshold is 4
  • the source string is "AAAABCDEF”
  • the description information is (7,7,5), (2,10,4), (0,6,4). ).
  • Extract 7 characters from the first unextracted character in the source string get “AAAABCD”, encode “AAAABCD” and (7,7,5), get compressed data of "AAAABCDAAAAB”; from the source string
  • the first unextracted character begins to extract 2 characters, get “EF”, encode “EF” and (2,10,4), get the compressed data of "EFBCDA”; the first one from the source string is not
  • the extracted characters start to extract 0 characters, do not get a string, and encode (2, 10, 4) to get the compressed data of "EFBC”.
  • FIG. 7 is a schematic structural diagram of a code stream unit obtained by encoding a source string "AAAABCD” and description information (7, 7, 5) using an Lz4 compression algorithm, corresponding to compressed data of "AAAABCDAAAAB”; 701 containing the first description information The value and the third value, 702 is the source string, and 703 is the second value of the description.
  • FIG. 8 is a schematic structural diagram of a code stream unit obtained by encoding a source character string “EF” and description information (2, 10, 4) by using an Lz4 compression algorithm, corresponding to compressed data of “EFBCDA”; 801 includes the first description information. The value and the third value, 802 is the source string, and 803 is the second value of the description.
  • FIG. 9 is a schematic structural diagram of a code stream unit obtained by encoding description information (0, 6, 4) using an Lz4 compression algorithm, corresponding to compressed data of "EFBC"; 901 includes first and third values of the description information, 902 is the second value of the description information.
  • each description information in the stored description information may be sequentially encoded to obtain compressed data of the source data.
  • the compressed data of the first data segment can be quickly generated by using the target field and the source string, which is simple to implement and has high coding efficiency.
  • Manner 2 storing target information; the target information includes the target field and the second character string; and the second character string is a character string adjacent to the target character string in the source data and located before the target character string.
  • the storage target information may be obtained by combining the target field and the second character string to obtain the target information, and storing the target information. Assuming the target field is (7, 7, 5) and the second string is "AAAABCD”, the target information is ("AAAABCD", 7, 7, 5). Assuming the target field is (2, 10, 4) and the second string is "EF”, the target information is ("EF", 2, 10, 4). It can be understood that, in this manner, the source string and the above description information are stored as one piece of data in the same storage space.
  • one implementation of compressing the encoded source data is as follows: acquiring the target information; and generating, by using the target algorithm, the second value, the second value, the third value, and the second string by using the target algorithm Compressed data of the data segment; the second data segment includes the second character string and the target character string.
  • the source data to be compressed is "AAAABCDAAAABEFBCDAEFBC"
  • the threshold is 4
  • the source string is "AAAABCDEF”
  • the description information is (7,7,5), (2,10,4), (0,6,4).
  • the stored information includes ("AAAABCD”, 7, 7, 5), ("EF", 2, 10, 4), ("", 0, 6, 4), where ("", 0, 6, 4) Indicates that EFBC has no adjacent source strings, ie EFBC is a string that needs to be compressed.
  • the target algorithm is the Lz4 compression algorithm
  • the target field is ("AAAABCD", 7, 7, 5)
  • the target field is encoded by the target algorithm, and the code stream unit shown in FIG. 7 is obtained, corresponding to the compressed data of "AAAABCDAAAAB”. .
  • the compressed data of the second data segment can be quickly generated by using the target field, which is simple to implement and can save coding time.
  • the compressed data may be decompressed by analyzing the compressed data to obtain the target algorithm indicated by the indication field, and decompressing the compressed data by using a decompression algorithm corresponding to the target algorithm to obtain the source data.
  • the target algorithm for analyzing the compressed data and obtaining the indication field indication may be to analyze the indication field, and determine the target algorithm according to the indication field.
  • the above codec device presets a correspondence between the indication field and the compression algorithm. After parsing the indication field, the codec device may determine, according to the correspondence between the indication field and the compression algorithm, a compression algorithm used to encode the source data, that is, the target algorithm.
  • the decompression algorithm used to decompress the compressed data is determined by parsing the indication field of the compressed data, and the decompression operation can be completed accurately and quickly.
  • An embodiment of the present invention provides a specific example of a codec method, as shown in FIG. 10, including:
  • the above source data corresponds to at least one description information.
  • multiple pieces of description information corresponding to the source data may be sequentially acquired.
  • the specific implementation is the same as that in FIG. 6.
  • the obtained description information and the source string may be sequentially stored in order.
  • step 603 in FIG. 6 The specific implementation is the same as step 603 in FIG. 6.
  • step 604 in FIG. 10 in the embodiment of the present invention, before the compression and encoding of the source data, the storage space required for encoding the description information of at least two compression algorithms is calculated, and the content required to encode the description information is selected.
  • the compression algorithm with a small storage space is used as the target algorithm; after the target algorithm is selected, the target data is used to encode the source data to obtain compressed data.
  • the storage space required to encode the source data is selected as a target algorithm; and the source data is compression-encoded by using the target algorithm; Under the condition of encoding time, the compression ratio is obviously improved.
  • FIG. 11 is a functional block diagram of a codec apparatus according to an embodiment of the present invention.
  • the functional blocks of the codec device may implement the inventive arrangements by hardware, software or a combination of hardware and software.
  • the functional blocks depicted in Figure 11 can be combined or separated into several sub-blocks to implement the inventive arrangements. Accordingly, the above description of the invention may support any possible combination or separation or further definition of the functional modules described below.
  • the codec may include:
  • the acquiring unit 1101 is configured to acquire source data to be compressed.
  • a determining unit 1102 configured to determine a source string and description information according to the source data to be compressed; the source string is a string that is not compressed in the source data, and the description information is used to describe the compressed string and the foregoing The correspondence of the source strings;
  • the calculating unit 1103 is configured to separately calculate a storage space required for the at least two compression algorithms to encode the foregoing description information;
  • the selecting unit 1104 is configured to select, as the target algorithm, a compression algorithm that occupies less storage space in the storage space required for encoding the foregoing description information by the at least two compression algorithms.
  • the encoding unit 1105 is configured to compress and encode the source data by using the target algorithm to obtain compressed data.
  • the target algorithm that requires less storage space for encoding the source data is selected; and the target data is used to compress and encode the source data; Under the condition of time, the compression ratio is obviously increased, and the occupied storage space is reduced.
  • the compressed data includes an indication field, where the indication field indicates the target algorithm.
  • the compression algorithm used to encode the source data is indicated by the indication field, so that when the compressed data obtained by compression coding the source data is decompressed, the decompression algorithm corresponding to the compression algorithm is used for decompression, and the decompression efficiency is improved.
  • the description information includes a target field, where the target field is used to describe a correspondence between the target string and the source string, where the target string belongs to the compressed string; the target field includes a first value, a second value, and a third value; the first value represents a positional relationship between the target character string and the source string; and the second value represents a starting position of the target string in the source string;
  • the third numerical value described above indicates the length of the target character string.
  • the correspondence between the target character string and the source character string is described by using the target field, so that the target character string is accurately determined by using the target field and the source string, and the coding efficiency is high.
  • the foregoing codec device further includes:
  • a first storage unit 1106, configured to separately store the source string and the foregoing description information
  • the encoding unit 1105 is specifically configured to obtain the target field from the description information, and obtain a first character string from the source string; the first character string is a character string adjacent to the target character string in the source data. Before the target character string is located; using the target algorithm to generate compressed data of the first data segment according to the first value, the second value, the third value, and the first string, the first data segment includes the first String and the above target string.
  • the compressed data of the first data segment can be quickly generated by using the target field and the source string, which is simple to implement and has high coding efficiency.
  • the foregoing codec device further includes:
  • a second storage unit 1107 configured to store target information;
  • the target information includes the target field and the second character string;
  • the second character string is a character string adjacent to the target character string in the source data, and is located in the target string prior to;
  • the coding unit 1105 is specifically configured to acquire the target information, and generate, by using the target algorithm, compressed data of the second data segment according to the first value, the second value, the third value, and the second character string;
  • the data fragment contains the above second character string and the above target character string.
  • the compressed data of the second data segment can be quickly generated by using the target field, which is simple to implement and can save coding time.
  • the foregoing codec device further includes:
  • the parsing unit 1108 is configured to parse the compressed data to obtain the target algorithm indicated by the indication field;
  • the decoding unit 1109 is configured to decompress the compressed data by using a decompression algorithm corresponding to the target algorithm to obtain the source data.
  • the decompression algorithm used to decompress the compressed data is determined by parsing the indication field of the compressed data, and the decompression operation can be completed accurately and quickly.
  • the acquiring unit 1101 is specifically configured to sequentially acquire the first-ever string in the source data according to the preceding and succeeding sequence in the source data, to obtain the source string, and search by using a hash algorithm. a character string matching the target character string in the source data; in the case of searching for the reference character string matching the target character string, determining that the target character string is a string that can be compression-encoded; Target field.
  • a hash algorithm is used to search for a string that can be encoded in the source data, and corresponding description information is generated, and the time overhead is small.
  • the codec device of the embodiment of the present invention may be implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD), and the PLD may be a complex program logic.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • the codec device and each module thereof may also be a software module.
  • the codec apparatus may correspond to performing the method described in the embodiments of the present invention, and the above-described and other operations and/or functions of the respective units in the codec apparatus are respectively implemented to implement the respective processes of the respective methods of FIG. For the sake of brevity, we will not repeat them here.
  • the codec device of the embodiment of the present invention selects a target algorithm that requires less storage space for encoding the source data before compressing and encoding the source data; and compresses and encodes the source data by using the target algorithm; Under the condition that the coding time is not significantly increased, the compression ratio is obviously increased, and the occupied storage space is reduced.
  • FIG. 12 is a schematic block diagram of a codec device according to another embodiment of the present invention.
  • the codec device in this embodiment may include: one or more processors 1201; one or more input devices 1202 and a memory 1203.
  • the processor 1201, the input device 1202, and the memory 1203 are connected by a bus 1204.
  • the memory 1203 is used to store a computer program, the computer program includes program instructions, and the processor 1201 is configured to execute program instructions stored in the memory 1203.
  • Input device 1202 is for inputting a compression command.
  • the processor 1201 is configured to invoke the foregoing program instruction to: obtain source data to be compressed, and determine a source string and description information according to the source data to be compressed; the source string is not compressed in the source data.
  • the description information is used to describe a correspondence between the compressed character string and the source string; respectively, calculating a storage space required for at least two compression algorithms to encode the description information; and selecting at least two compression algorithms to encode the foregoing
  • a compression algorithm that occupies a small storage space in the storage space required for the description information is used as a target algorithm; the above source data is compression-encoded using the above target algorithm to obtain compressed data.
  • the processor 1201 may be a central processing unit (CPU), and the processor may also be another general-purpose processor, a digital signal processor (DSP). , Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the processor 1201 described above can implement the functions of the obtaining unit 1101, the determining unit 1102, the calculating unit 1103, the selecting unit 1104, the encoding unit 1105, the parsing unit 1108, and the decoding unit 1109 as shown in FIG.
  • the memory 1203 includes, but is not limited to, a random access memory (RAM), a read-only memory (ROM), and an Erasable Programmable Read Only Memory (EPROM). Or a Compact Disc Read-Only Memory (CD-ROM), which can be used to store related instructions and data.
  • RAM random access memory
  • ROM read-only memory
  • EPROM Erasable Programmable Read Only Memory
  • CD-ROM Compact Disc Read-Only Memory
  • the processor 1201, the input device 1202, and the memory 1203, which are described in the embodiments of the present invention, may be implemented as described in the codec method provided by the embodiment of the present invention.
  • the implementation of the decoding device will not be described here.
  • the codec device may correspond to the device for implementing codec shown in FIG. 11 in the embodiment of the present invention, and may correspond to the implementation of the codec method in FIG. 6 according to the embodiment of the present invention.
  • the above-mentioned and other operations and/or functions of the respective modules in the codec device are respectively implemented in order to implement the corresponding processes of the respective methods in FIG. 6.
  • no further details are provided herein.
  • the codec device of the embodiment of the present invention selects a target algorithm that requires less storage space for encoding the source data before compressing and encoding the source data, and compresses and encodes the source data by using the target algorithm; Under the condition that the coding time is not significantly increased, the compression ratio is obviously increased, and the occupied storage space is reduced.
  • a computer readable storage medium stores a computer program, where the computer program includes program instructions, and the program instructions are executed by a processor to: acquire a to be compressed The source data, the source string and the description information are determined according to the source data to be compressed; the source string is a string that is not compressed in the source data, and the description information is used to describe the compressed string and the source string.
  • Corresponding relationship respectively calculating a storage space required for encoding the above description information by at least two compression algorithms; selecting a compression algorithm occupying a storage space that is required to occupy the above description information by the at least two compression algorithms as a target An algorithm; compressing and encoding the source data by using the target algorithm to obtain compressed data.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains one or more sets of available media.
  • the usable medium can be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium.
  • the semiconductor medium can be a solid state drive (SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本申请公开了一种编解码方法、装置及编解码设备,该方法包括:获取待压缩的源数据,根据所述待压缩的源数据确定源字符串以及描述信息;所述源字符串为所述源数据中不被压缩的字符串,所述描述信息用于描述被压缩的字符串与所述源字符串的对应关系;分别计算至少两种压缩算法编码所述描述信息所需占用的存储空间;选择所述至少两种压缩算法编码所述描述信息所需占用的存储空间中占用存储空间较小的压缩算法作为目标算法;使用所述目标算法对所述源数据进行压缩编码,得到压缩数据。在压缩编码过程中自适应地选择所需占用的存储空间较小的压缩算法对所述源数据进行压缩编码,提高压缩比。

Description

编解码方法、装置及编解码设备 技术领域
本发明涉及数据处理技术领域,尤其涉及一种编解码方法、装置及编解码设备。
背景技术
数据压缩是指在不丢失信息的前提下,缩减数据量以减少存储空间,提高其传输、存储和处理效率的一种技术方法。或者,按照一定的算法对数据进行重新组织,减少数据的冗余和存储的空间。数据压缩包括有损压缩和无损压缩。无损压缩可以完全无损还原压缩前的数据,编码开销相对于有损压缩较小,一般用于桌面文字区域的压缩。
在无损压缩领域中,于1977年诞生的Lz77字典编码算法是一个里程碑式事件。Lz77编码是一种开源的字典压缩算法,属于无损压缩。如今,Lz77算法已被广泛应用于各种数据压缩处理领域,由其派生的各种压缩算法也层出不穷,但都是属于Lz77算法这一大类。Lz77的衍生算法(如Lzss、Lzo、Lz4)以及组合算法(zlib、Lzma、zstd)等被广泛应用于数据存储、带宽传输等方面。随着互联网、物联网的飞速发展,数据文件规模越来越大,人们更需要提高数据的压缩比,减少数据占用的存储空间和传输所需的时间。
当前,采用的方案是利用预先设置的压缩算法对待压缩数据进行压缩编码。举例来说,编解码装置预先设置采用Lz4压缩算法对待压缩数据进行压缩编码。由于不同的压缩算法采用的压缩编码规则不同。因此,采用不同的压缩算法编码同一待压缩数据,所需占用的存储空间不同,即得到的压缩数据的压缩比不同。在当前采用的技术方案中,预先设置的压缩算法难以保证对任意一个待压缩数据进行压缩均获得较优的压缩比,即占用的存储空间均较小。
发明内容
本申请提供一种编解码方法,可保证对任意一个待压缩数据进行压缩编码均获得较优的压缩比,减少占用的存储空间。
第一方面,本申请提供了一种编解码方法,该方法包括:
获取待压缩的源数据;根据该待压缩的源数据确定源字符串和描述信息,该源字符串为该源数据中不被压缩的字符串,该描述信息用于描述被压缩的字符串与该源字符串的对应关系;选择预置的多种压缩算法中编码该描述信息所需占用的存储空间较小的压缩算法作为目标算法,使用该目标算法对该源字符串和该描述信息进行压缩编码,得到压缩数据。
该待压缩的源数据包含的字符串可以分为两种:一种是可被压缩的字符串,即在源数据中非首次出现需要被压缩的字符串;另一种是不能被压缩的字符串,即在源数据中首次出现且不需要被压缩的字符串。该源数据中可被压缩的字符串可以为该源数据中非首次出现的且长度超过阈值的字符串,该阈值可以是3、4、5、6等。字符串的长度是指字符串包含的字符个数。阈值是指被压缩的字符串包含的最少字符个数,即被压缩字符的最小长度。该源数据中不能被压缩的字符串可以为首次出现的字符串和/或未超过该阈值的字符串。
本申请中,选择编码描述信息所需占用的存储空间中占用存储空间较小的压缩算法作为目标算法;使用该目标算法对该源数据进行压缩编码,得到压缩数据;可以在压缩编码过程中自适应地选择所需占用的存储空间较小的压缩算法对该源数据进行压缩编码,减小所需占用的存储空间,提高压缩比。
在一种可选的实现方式中,对源数据进行压缩编码得到的压缩数据包含指示字段,该指示字段指示该目标算法。也就是说,该指示字段用于指示编码该源数据所采用的压缩算法。
本申请通过指示字段指示编码源数据所采用的压缩算法,以便于在解压该源数据压缩编码得到的压缩数据的时,采用该压缩算法对应的解压算法进行解压,提高解压效率。
在另一种可选的实现方式中,该描述信息包含目标字段;该目标字段用于描述目标字符串与源字符串的对应关系,该目标字符串属于被压缩的字符串;该目标字段包含第一数值、第二数值以及第三数值;该第一数值表示该目标字符串与该源字符串的位置关系;该第二数值表示该目标字符串在该源字符串中的起始位置;该第三数值表示该目标字符串的长度。
可以理解,该描述信息可以包含一个或者多个(包括两个)字段,每个字段描述一个被压缩的字符串和该源字符串的对应关系。该源字符串为该源数据中不能被压缩编码的字符串。该源字符串可以为该源数据中首次出现的字符串和/或长度未超过该阈值的字符串。该源数据中位于该目标字符串之前的字符串包含该目标字符串,即该目标字符串为该源数据中非首次出现的字符串。可选地,该目标字符串的长度超过该阈值。一类词典编码的基本思想是查找正在被压缩的字符串是否在之前输入的待压缩数据中出现过;如果是,则用与之前出现过的字符串相关的描述信息代替重复出现的字符串。例如,Lz4无损压缩算法采用“偏移值-匹配长度”,即<offset,length>来代替之前出现的字符串,再以特定的无歧义形式在码流中表示出来。举例来说,采用Lz4无损压缩算法进行压缩编码,待压缩的源数据为AAAABCDAAAA,阈值为4;该源数据中重复出现的字符串为AAAA,对应的偏移值和匹配长度为<7,4>;该源数据可以表示为AAAABCD<7,4>,前面的“AAAABCD”为该源数据中的源字符串,<7,4>可以为描述信息,通过该描述信息和该源字符串可以得到“AAAA”。
本申请提供的压缩编码的基本思想可以是利用与源数据中的源字符串相关的描述信息替代重复出现的字符串。该目标字段用于描述目标字符串与该源字符串的对应关系,通过该源字符串和该目标字段可以解码得到该目标字符串。
本申请中,利用目标字段描述目标字符串与源字符串的对应关系,以便于利用该 目标字段和该源字符串准确地确定该目标字符串,编码效率高。
在另一种可选的实现方式中,在确定该源字符串以及该描述信息后,可以执行如下操作实现对该源数据的压缩编码:
分别存储该源字符串和该描述信息;从该描述信息中获取该目标字段;从该源字符串中获取第一字符串;该第一字符串为该源数据中该目标字符串相邻的字符串且位于该目标字符串之前;采用该目标算法依据该第一数值、该第二数值、该第三数值以及该第一字符串生成第一数据片段的压缩数据,该第一数据片段包含该第一字符串和该目标字符串。
该描述信息可以包含该源数据中两个或者两个以上被压缩的字符串对应的描述信息,每个被压缩的字符串对应一个描述信息。也就是说,该描述信息至少包含一个被压缩的字符串对应的描述信息。
本申请中,利用目标字段和源字符串可以快速地生成第一数据片段的压缩数据,实现简单,编码效率高。
在另一种可选的实现方式中,在确定该源字符串以及该描述信息后,可以执行如下操作实现对该源数据的压缩编码:
将该目标字段和第二字符串存储为一个信息,得到目标信息,该第二字符串为该源数据中该目标字符串相邻的字符串且位于该目标字符串之前;获取该目标信息;采用该目标算法依据该第一数值、该第二数值、该第三数值以及该第二字符串生成第二数据片段的压缩数据;该第二数据片段包含该第二字符串和该目标字符串。
本申请中,利用目标字段可以快速地生成第二数据片段的压缩数据,实现简单,可以节省编码时间。
在另一种可选的实现方式中,所述使用所述目标算法对所述源数据进行压缩编码,得到压缩数据之后,所述方法还包括:
解析所述压缩数据,得到所述指示字段指示的所述目标算法;
利用所述目标算法对应的解压算法对所述压缩数据进行解压,得到所述源数据。
本申请通过解析压缩数据的指示字段确定解压该压缩数据所需采用的解压算法,可以准确、快速地完成解压操作。
在另一种可选的实现方式中,确定源字符串以及目标字段的具体方式如下:
按照位于该源数据中的前后顺序依次获取该源数据中首次出现的字符串,得到该源字符串;采用哈希算法搜索该源数据中与该目标字符串相匹配的字符串;在搜索到与该目标字符串相匹配的字符串的情况下,确定该目标字符串为可被压缩编码的字符串;获得该第一数值、该第二数值以及该第三数值,生成该目标字段。
本申请采用哈希算法搜索源数据中可被编码的字符串,并生成相应的描述信息,时间开销小。
第二方面,本申请提供了一种编解码装置,该编解码装置包括:
获取单元,用于获取待压缩的源数据;
确定单元,用于根据该待压缩的源数据确定源字符串以及描述信息;该源字符串为该源数据中不被压缩的字符串,该描述信息用于描述被压缩的字符串与该源字符串的对应关系;
计算单元,用于分别计算至少两种压缩算法编码该描述信息所需占用的存储空间;
选择单元,用于选择该至少两种压缩算法编码该描述信息所需占用的存储空间中占用存储空间较小的压缩算法作为目标算法;
编码单元,用于使用该目标算法对该源数据进行压缩编码,得到压缩数据。
本申请通过在对源数据进行压缩编码之前,选择编码该源数据所需占用的存储空间较小的目标算法;并使用该目标算法对该源数据进行压缩编码;可以在不显著增加编码时间的条件下,明显提高压缩比,减小占用的存储空间。
在一种可选的实现方式中,该压缩数据包含指示字段,该指示字段指示该目标算法。
本申请通过指示字段指示编码源数据所采用的压缩算法,以便于在解压该源数据压缩编码得到的压缩数据的时,采用该压缩算法对应的解压算法进行解压,提高解压效率。
在另一种可选的实现方式中,该描述信息包含目标字段;该目标字段用于描述目标字符串与该源字符串的对应关系,该目标字符串属于被压缩的字符串;该目标字段包含第一数值、第二数值以及第三数值;该第一数值表示该目标字符串与该源字符串的位置关系;该第二数值表示该目标字符串在该源字符串中的起始位置;该第三数值表示该目标字符串的长度。
本申请中,利用目标字段描述目标字符串与源字符串的对应关系,以便于利用该目标字段和该源字符串准确地确定该目标字符串,编码效率高。
在另一种可选的实现方式中,该编解码装置还包括:
第一存储单元,用于分别存储该源字符串和该描述信息;
该编码单元,具体用于从该描述信息中获取该目标字段;从该源字符串中获取第一字符串;该第一字符串为该源数据中该目标字符串相邻的字符串且位于该目标字符串之前;采用该目标算法依据该第一数值、该第二数值、该第三数值以及该第一字符串生成第一数据片段的压缩数据,该第一数据片段包含该第一字符串和该目标字符串。
本申请中,利用目标字段和源字符串可以快速地生成第一数据片段的压缩数据,实现简单,编码效率高。
在另一种可选的实现方式中,该编解码装置还包括:
第二存储单元,用于将该目标字段和第二字符串存储为一个信息,得到目标信息;该第二字符串为该源数据中该目标字符串相邻的字符串且位于该目标字符串之前;
该编码单元,具体用于获取该目标信息;采用该目标算法依据该第一数值、该第二数值、该第三数值以及该第二字符串生成第二数据片段的压缩数据;该第二数据片段包含该第二字符串和该目标字符串。
本申请中,利用目标字段可以快速地生成第二数据片段的压缩数据,实现简单,可以节省编码时间。
在另一种可选的实现方式中,所述编解码装置还包括:
解析单元,用于解析所述压缩数据,得到所述指示字段指示的所述目标算法;
解码单元,用于利用所述目标算法对应的解压算法对所述压缩数据进行解压,得到所述源数据。
本申请通过解析压缩数据的指示字段确定解压该压缩数据所需采用的解压算法,可以准确、快速地完成解压操作。
在另一种可选的实现方式中,该获取单元,具体用于按照位于该源数据中的前后顺序依次获取该源数据中首次出现的字符串,得到该源字符串;采用哈希算法搜索该源数据中与该目标字符串相匹配的字符串;在搜索到与该目标字符串相匹配的该参考字符串的情况下,确定该目标字符串为可被压缩编码的字符串;生成该目标字段。
本申请采用哈希算法搜索源数据中可被编码的字符串,并生成相应的描述信息,时间开销小。
第三方面,本申请提供一种编解码设备,包括处理器和存储器,所述处理器和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行上述第一方面及第一方面的任意一种可选的实现方式的方法。
本申请中,编解码设备可以从预先设置的多种压缩算法中选择编码源数据所需占用的存储空间较小的压缩算法对该源数据进行压缩编码,减小所需占用的存储空间,提高压缩比。
第四方面,本申请提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面及第一方面的任意一种可选的实现方式的方法。
本申请中,可以在压缩编码过程中自适应地选择编码源数据所需占用的存储空间较小的压缩算法对该源数据进行压缩编码,达到减小占用的存储空间以及提高压缩比的目的。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为Lz4压缩算法采用的码流单元结构示意图;
图2是采用Lz4压缩算法生成的一种码流单元结构示意图;
图3是采用data-shrinker压缩算法生成的一种码流单元结构示意图;
图4是采用Lizard压缩算法生成的一种码流单元结构示意图;
图5是现有技术中的编码流程示意图;
图6为本申请提供的一种编解码方法流程示意图;
图7是采用Lz4压缩算法生成的另一种码流单元结构示意图;
图8是采用Lz4压缩算法生成的又一种码流单元结构示意图;
图9是采用Lz4压缩算法生成的又一种码流单元结构示意图;
图10为本申请提供的另一种编解码方法流程示意图;
图11是本申请提供的一种编解码装置结构示意图;
图12是本申请提供的一种编解码设备结构示意图。
具体实施方式
为了更好介绍本发明实施例,首先介绍一下与本发明相关的传统的压缩算法的实现方式:
(一)Lz4无损压缩算法
Lz4无损压缩算法是Lz77压缩算法衍生的一种无损压缩算法。Lz4无损压缩算法的压缩原理与Lz77压缩算法的压缩原理完全相同,就是采用“偏移值-匹配长度”,即<offset,length>来代替曾经出现的符号,再以特定的无歧义形式表示出来。例如,待压缩的数据为AAAABCDAAAA,表示为AAAABCD<7,4>;其中,<7,4>代替该待压缩的数据中后面的“AAAA”。AAAABCD<7,4>可以理解为该待压缩的数据对应的编码信息的一种表示形式。
采用Lz4无损压缩算法对待压缩的源数据进行压缩编码,首先需要获取偏移值、匹配长度、字符长度以及源字符串;再将获取的偏移值、匹配长度、字符长度以及源字符串采用特定的形式写入码流。采用特定的形式写入码流是指采用Lz4无损压缩算法对应的压缩编码规则编码获取的偏移值、匹配长度、字符长度以及源字符串,得到相应的码流单元。编码得到的一个个码流单元形成码流。在采用压缩算法对待压缩的源数据进行压缩编码的过程中,会依次生成多个码流单元,每个码流单元对应该源数据中一个数据片段的压缩数据,这些依次生成的码流单元可以理解为码流。可以理解,待压缩的源数据包含多个数据片段,每个码流单元对应一个数据片段的压缩数据。对待压缩的源数据进行压缩编码的过程就是对各个数据片段进行压缩编码,得到各个数据片段对应的码流单元的过程。该偏移值表示匹配数据的起始位置,即被压缩的字符串在该源字符串中的起始位置。该匹配数据为该源字符串中与该被压缩的字符串相同的字符串。例如,假定源数据为“AAAABCDAAAA”,该源数据中倒数四位的“AAAA”为被压缩的字符串,该源数据中第一至第四个字节中的“AAAA”为匹配数据,偏移值为7,即Offset=7,该偏移值表示从被压缩的字符串的起始位置向前7个字节为该匹配数据的起始位置。该匹配长度为该匹配数据的长度,也是被压缩的字符串的长度。由于该匹配数据为“AAAA”,该匹配长度为4,即Match length=4,表示匹配数据的长度为4个字节。该字符长度表示被压缩的字符串前面相邻的不能被压缩编码的字符串的长度。该字符长度为7,即Literal length=7,表示被压缩的字符串前面相邻的7个字节不能被编码,即该源数据中“AAAABCD”不能被编码。该源字符串表示该源数据中未编码的字符串。该源字符串为“AAAABCD”,即Literals=AAAABCD,表示该源数据中7个未编码的字符串是“AAAABCD”。本申请中,匹配长度可以表示为Match length,字符长度可以表示为Literal length。
图1为Lz4压缩算法采用的码流单元结构示意图。如图1所示,第一字段101包含匹配长度和字符长度的全部或者部分信息,即Literal length和Match length的部分或全部信息,该第一字段占用一个字节;第二字段102表示该字符长度超过15字节的部分,若该字符长度未超过15字节,则该第二字段102不存在;第三字段103为未编码的字符串,即源字符串;第四字段104占用2字节,用来记录偏移值;第五字段105表示该匹配长度超过15字节的部分,若该匹配长度未超过15字节,则该第五字段105不存在。该第一字段可以表示为“Token”控制字节。可以理解,若字符长度超过15字节,则码流单元包含第二字段102,表示该字符长度超过15字节的部分;否者,该 码流单元不包含第二字段102。若匹配长度超过15字节,则码流单元包含第五字段105,表示该匹配长度超过15字节的部分;否者,该码流单元不包含第五字段105。可以理解,若字符长度和匹配长度均未超过15字节,则该第一字段101包含匹配长度和字符长度的全部信息;否则,该第一字段101包含匹配长度和字符长度的部分信息,剩余的信息分别包含在该第五字段和该第二字段。具体生成码流的规则如下:
1、每一次生成“偏移值-匹配长度”,即<offset,length>时,新建一个“码流单元”,该码流单元以第一字段,即“Token”控制字节开始,该第一字段中包含匹配长度和字符长度的部分或全部信息,即Literal length和Match length的部分或全部信息。
1.1、若该字符长度小于15,即Literal length<15,则会被写入该第一字段的高4位,且后续不再有字节用来表示该字符长度,反之,该字符长度超过15字节的部分,会在该第二字段102采用前缀编码继续表示。
1.2、若该匹配长度小于15,即Match length<15,则会被写入该第一字段的低4位,其后续不再有字节用来表示该匹配长度;反之该匹配长度超过15字节的部分,会在该第五字段105采用前缀编码继续表示。
2、该第二字段102和第五字段105采用相同的前缀编码形式,即在超过15的基础上,每次超过255,就会增加1个0xFF字节,直到最后不足255,写入并结束。
3、该第三字段103会将未被编码的字符串原样输入。
可以理解,将源数据中的源字符串输入码流单元中,得到该第三字段103。也就是说,该第三字段103存储未被编码的字符串,即源字符串。
4、该第四字段104固定占用2字节,用来记录偏移值。
依据以上生成码流的规则,采用Lz4压缩算法对字符串“AAAABCDAAAA”进行压缩编码后,得到如图2所示的码流单元。图2为采用Lz4压缩算法对源数据进行压缩编码得到的一种码流单元结构示意图。如图2所示,201占用一个字节,高4位表示字符长度7,即Literal length=7,低4位表示字符长度4,即Match length=4;202存储未被编码的字符串,即源字符串“AAAABCD”;203固定占用两个字节,存储偏移值。在实际应用中,码流单元中以比特位表示各种形式的数据。图2中为了清楚的显示各个字段,将各个字段对应的比特位采用十六进制进行表示。
Lz4压缩算法的字节输出模型
匹配长度表示为Match length,字符长度表示为Literals length,偏移值表示为Offset。根据偏移值、匹配长度以及字符长度的写入规则,可以计算得到输出字节数,具体计算规则如下:
1)当Literals length<15且Match length<15时
输出字节数=1+0+Literals length+2+0。
2)当Literals length>=15且Match length<15时
输出字节数=1+(└(Literals length-15)/255┘+1)+Literals length+2+0。
3)当Literals length<15且Match length>=15时
输出字节数=1+0+Literals length+2+(└(Match length-15)/255┘+1)。
4)当Literals length>=15且Match length>=15时
输出字节数=1+(└(Literals length-15)/255┘+1)+Literals length+2+(└(Match  length-15)/255┘+1);其中,“└x┘”是指取不超过x的最小整数。
通过Lz4压缩算法的字节输出模型可以准确、快速地计算出采用Lz4压缩算法编码源数据所需占用的存储空间。可以理解,在生成该偏移值、该匹配长度以及该字符长度对应的码流单元之前,利用该字节输出模型可以确定生成码流单元所需占用的存储空间。
(二)data-shrinker压缩算法
data-shrinker压缩算法是与Lz4类似的另一种开源无损压缩算法,其在码流单元的设计上与Lz4大致相同,不同点在于:
(1)包含匹配长度和字符长度的部分或全部信息的第一字段,即“Token”控制字节中的8比特由“4+4”变成“3+1+4”,分出来的1个比特用来表示第四字段占用一个字节还是2个字节,即Offset是1字节数据,还是2字节数据;该第四字段用于记录偏移值。
(2)该第四字段,即记录偏移值的字段,不再固定为2个字节。当偏移值小于256,即Offset<256时,占用1个字节表示该偏移值,反之占用2个字节表示该偏移值。
(3)字符长度的前缀编码方式变化,若该字符长度小于7,即Literal length<7,则会被写入该第一字段的高3位,且后续不再有字节用来表示该字符长度,反之,该字符长度超过7字节的部分,会在第二字段采用前缀编码继续表示。
(4)相对于Lz4压缩算法的码流写入过程,匹配长度与源字符串的写入顺序交换,即码流单元中第三字段和第五字段的顺序交互。
因此,例子中的字符串“AAAABCDAAAA”采用data-shrinker压缩算法进行压缩编码后,得到如图3所示的码流单元。如图3所示,301占用一个字节,高3位表示字符长度7,即Literal length,低4位表示字符长度4,即Match length,第四位为0,指示偏移值占用一个字节;302表示该匹配长度超过7字节的部分,即Literal length超过7字节的部分;303占用1个字节,存储偏移值;304存储未被编码的字符串,即源字符串“AAAABCD”。在实际应用中,该码流单元中以比特位表示各种形式的数据。图3中为了清楚地显示各个字段,将各个字段对应的比特位采用十六进制进行表示。
从编码结果上看,Lz4压缩算法与data-shrinker压缩算法获得的码流长度相当,data-shrinker压缩算法相对Lz4压缩算法而言存在如下缺点和优点:
1、第一字段中1个比特被挪用,即“Token”控制字节中一个比特被挪用,造成data-shrinker压缩算法对于字符长度的编码能力下降。例子中字符长度为7,即Literal length=7,Lz4压缩算法中只需要1字节就可以表示,但是在data-shrinker压缩算法中需要2字节表示。
2、在记录偏移值上,data-shrinker压缩算法的编码能力增强。例子中采用data-shrinker压缩算法进行压缩编码时没有造成Lz4压缩算法的字节浪费现象。对比图2和图3可以看出,采用data-shrinker压缩算法占用1个字节记录偏移值,采用Lz4压缩算法占用2个字节记录偏移值。
data-shrinker压缩算法的字节输出模型
类似地,可以得到data-shrinker压缩算法的字节输出模型,具体计算规则如下:
1)当Literals length<7且Match length<15且Offset<256时
输出字节数=1+0+1+0+Literals length。
2)当Literals length<7且Match length<15且Offset>=256时
输出字节数=1+0+2+0+Literals length。
3)当Literals length>=7且Match length<15且Offset<256时
输出字节数=1+(└(Literals length-7)/255┘+1)+1+0+Literals length。
4)当Literals length>=7且Match length<15且Offset>=256时
输出字节数=1+(└(Literals length-7)/255┘+1)+2+0+Literals length。
5)当Literals length<7且Match length>=15且Offset<256时
输出字节数=1+0+1+(└(Match length-15)/255┘+1)+Literals length。
6)当Literals length<7且Match length>=15且Offset>=256时
输出字节数=1+0+2+(└(Match length-15)/255┘+1)+Literals length。
7)当Literals length>=7且Match length>=15且Offset<256时
输出字节数=1+(└(Literals length-7)/255┘+1)+1+(└(Match length-15)/255┘+1)+Literals length。
8)当Literals length>=7且Match length>=15且Offset>=256时
输出字节数=1+(└(Literals length-7)/255┘+1)+2+(└(Match length-15)/255┘+1)+Literals length。
通过data-shrinker压缩算法的字节输出模型可以准确、快速地计算出采用data-shrinker压缩算法编码源数据所需占用的存储空间。可以理解,在生成该偏移值、该匹配长度以及该字符长度对应的码流单元之前,利用该data-shrinker压缩算法的字节输出模型可以确定出采用data-shrinker压缩算法编码源数据生成码流单元所需占用的存储空间。
(三)Lizard压缩算法
Lizard压缩算法(一度称为“Lz5”)是与Lz4压缩算法类似的另一种开源无损压缩算法。Lizard压缩算法与Lz4压缩算法相比,第一字段,即“Token”控制字节的设计更具有类似于“熵编码”的特性。熵编码即编码过程中按熵原理不丢失任何信息的编码。Lizard压缩算法与Lz4压缩算法主要有以下区别:
(1)“Token”控制字节中的8比特,前1-2位是熵编码前缀标记,分为如下3种形式:
第一种为1OO LL MMM:
若偏移值在0-1023之间,则最高位为1;第二位和第三位与后面记录偏移值占用的一个字节连起来共计10比特,共同记录偏移值;第四位和第五位占用的2比特表示字符长度的前缀编码,即Literal length的前缀编码;第六位至第八位占用的3比特表示匹配长度的前缀编码,即Match length的前缀编码。“Token”控制字节占用一个字节,一个字节包含8个比特。
第二种为00LLL MMM:
若偏移值在1024-65535之间,则最高2比特是00;编码过程中偏移值由两字节记录;第三位至第五位占用3比特表示字符长度的前缀编码,即Literal length的前缀编 码;第六位至第八位占用的3比特表示匹配长度的前缀编码,即Match length的前缀编码。
第三种为01LLL MMM:
若偏移值在65536-16777215之间,则最高2比特是01;编码过程中偏移值由3字节记录;第三位至第五位占用3比特表示字符长度的前缀编码,即Literal length的前缀编码;第六位至第八位占用的3比特表示匹配长度的前缀编码,即Match length的前缀编码。
(2)偏移值不再固定为2字节,由第一字段关联标记,即由“Token”控制字节关联标记,占用1-3字节。
(3)字符长度的前缀编码方式变化。
在“Token”控制字节为1OO LL MMM的条件下,若该字符长度小于3,即Literal length<3,则会被写入“Token”控制字节的第四位和第五位,且后续不再有字节用来表示该字符长度,反之,该字符长度超过3字节的部分,会在“Literal length+”部分采用前缀编码继续表示;该“Literal length+”对应图1中的第二字段;
在“Token”控制字节不为1OO LL MMM的条件下,若该字符长度小于7,即Literal length<7,则会被写入“Token”控制字节的第三位至第五位,且后续不再有字节用来表示该字符长度,反之,该字符长度超过7字节的部分,会在“Literal length+”部分采用前缀编码继续表示;
(4)匹配长度的前缀编码方式变化。
若该匹配长度小于7,即Match length<7,则会被写入“Token”控制字节的第六位至第八位,且后续不再有字节用来表示该匹配长度,反之,该匹配长度超过7字节的部分,会在“Match length+”部分采用前缀编码继续表示;该“Match length+”对应图1中的第五字段。
因此,例子中的字符串“AAAABCDAAAA”采用Lizard压缩算法进行压缩编码后,得到如图4所示的码流单元。如图4所示,第一字段401为“0x9C”,对应的8比特为“10011100”,第二位和第三位与后面记录偏移值占用的一个字节(404)连起来共计10比特,共同记录偏移值;第四位和第五位占用的2比特表示字符长度的前缀编码;第六位至第八位占用的3比特表示匹配长度的前缀编码;第二字段402表示字符长度超过3字节的部分,该第二字段402为“0x04”,对应的8比特为“00000100”;403表示未被编码的字符串,即源字符串;404占用一个字节,用于记录偏移值。在图4中,由该第一字段401的第二位至第三位占用的2比特以及404占用的8比特组成的10比特,表示偏移值7;由该第一字段401的第六位至第八位占用的3比特,即“100”,表示匹配长度4;由该第一字段401的第四位至第五位占用的2比特,即“11”,以及该第二字段402占用的8比特,即“00000100”之和表示字符长度7。
从编码结果上看,Lizard压缩算法与Lz4压缩算法、data-shrinker压缩算法获得的码流长度相当。但是,Lizard压缩算法的“Token”控制字节已经带有“熵编码”的特点,其码流的写法也较Lz4压缩算法、data-shrinker压缩算法更加复杂。Lizard压缩算法相比于Lz4压缩算法、data-shrinker压缩算法而言编码较慢;压缩比更高。在一般情况下,由于经常发生短匹配,偏移值、匹配长度以及字符长度均偏小,Lizard压缩 算法编码待压缩的源数据占用的存储空间更小。
Lizard压缩算法的字节输出模型
类似地,可以得到Lizard压缩算法的字节输出模型,具体计算规则如下:
1)当Offset<1024,Literals length<3,Match Length<7时
输出字节数=1+0+1+0+Literals length。
2)当Offset<1024,Literals length>=3,Match Length<7时
输出字节数=1+(└(Literals length-3)/255┘+1)+1+0+Literals length。
3)当Offset<1024,Literals length<3,Match Length>=7时
输出字节数=1+0+1+(└(Match length-7)/255┘+1)+Literals length。
4)当Offset<1024,Literals length>=3,Match Length>=7时
输出字节数=1+(└(Literals length-3)/255┘+1)+1+(└(Match length-7)/255┘+1)+Literals length。
5)当1024<=Offset<65536,Literals length<7,Match Length<7时
输出字节数=1+0+2+0+Literals length。
6)当1024<=Offset<65536,Literals length>=7,Match Length<7时
输出字节数=1+(└(Literals length-7)/255┘+1)+2+0+Literals length。
7)当1024<=Offset<65536,Literals length<7,Match Length>=7时
输出字节数=1+0+2+(└(Match length-7)/255┘+1)+Literals length。
8)当1024<=Offset<65536,Literals length>=7,Match Length>=7时
输出字节数=1+(└(Literals length-7)/255┘+1)+2+(└(Match length-7)/255┘+1)+Literals length。
9)当Offset>=65536,Literals length<7,Match Length<7时
输出字节数=1+0+3+0+Literals length。
10)当Offset>=65536,Literals length>=7,Match Length<7时
输出字节数=1+(└(Literals length-7)/255┘+1)+3+0+Literals length。
11)当Offset>=65536,Literals length<7,Match Length>=7时
输出字节数=1+0+3+(└(Match length-7)/255┘+1)+Literals length。
12)当Offset>=65536,Literals length>=7,Match Length>=7时
输出字节数=1+(└(Literals length-7)/255┘+1)+3+(└(Match length-7)/255┘+1)+Literals length。
通过Lizard压缩算法的字节输出模型可以准确、快速地计算出采用Lizard压缩算法编码源数据所需占用的存储空间。可以理解,在生成该偏移值、该匹配长度以及该字符长度对应的码流单元之前,利用该Lizard压缩算法的字节输出模型可以计算出采用Lizard压缩算法编码源数据生成码流单元所需占用的存储空间。通过字节输出模型计算压缩算法编码描述信息为本申请的一个发明点。在传统的压缩算法中,并不会计算编码描述信息或者源数据所需占用的存储空间。
上面介绍了传统的Lz4压缩算法、data-shrinker压缩算法以及Lizard压缩算法的编码规则、码流单元结构、字节输出模型等。传统的Lz4压缩算法、data-shrinker压缩算法以及Lizard压缩算法压缩不同类型的待压缩数据获得的压缩比均不是固定不变 的。可以理解,不同的压缩算法编码源数据所需占用的存储空间可能不同,任何一种压缩算法都难以保证对任意一种源数据进行压缩编码均获得较高的压缩比。可以理解,采用在对源数据进行压缩编码之前,可以利用可采用的压缩算法对应的字节输出模型计算出编码该源数据所需占用的存储空间。下面介绍传统的压缩算法采用的编码流程,如图5所示,包括:
501、获取待压缩的源数据。
502、获取该源数据的源字符串和语义信息。
该获取该源数据的源字符串和语义信息可以是在遍历该源数据的过程中获取的。语义信息为被压缩的字符串对应的编码信息。例如,Lz4无损压缩算法中的偏移值和匹配长度。
503、编码源字符串和语义信息,得到码流单元。
504、判断是否完成压缩编码。
若是,执行502,若否,执行505。若该语义信息和该源字符串均完成压缩编码,则判断完成压缩编码。
505、停止压缩编码。
在一般情况下,Lz77系列的压缩算法,在遍历源数据的过程中,计算<偏移值,匹配长度>进而产生语义信息,然后通过规定的编码规则编码语义信息得到各个码流单元,各个码流单元形成码流。当源数据遍历完成,码流也就创建完成。从图5可以看出,当前采用的压缩编码方式是采用预置的压缩算法对源数据进行压缩编码,在完成压缩编码之前,不能确定编码该源数据所需占用的存储空间。由于一种压缩算法难以保证对任意类型的源数据进行压缩编码均获得较高的压缩比,导致编码某些类型的源数据时候需要占用较大的存储空间。
为了解决传统地编解码方法的压缩比低的问题,本申请提供一种编解码方法,其主要原理包括:在遍历源数据的过程中,根据待压缩的源数据确定源字符串和描述信息,并存储该源字符串和该描述信息;在编码该源数据之前,计算预置的至少两种压缩算法编码该描述信息所需占用的存储空间;选择所需占用的存储空间较小的压缩算法依据存储的该源字符串和该描述信息进行压缩编码,得到压缩数据。在对该源数据进行编码之前,可以精确计算各压缩算法编码该源数据所需占用的存储空间。因此,本申请可以根据各压缩算法所需占用的存储空间的大小进行比较,然后再根据业务需求择优选取存储空间占用较小的算法对待压缩数据进行压缩,从而提高待压缩数据的压缩比,节省存储空间,保证了压缩效果。
本发明实施例提供了一种编解码方法,如图6所示,包括:
601、获取待压缩的源数据,根据上述待压缩的源数据确定源字符串以及描述信息;上述源字符串为上述源数据中不被压缩的字符串,上述描述信息用于描述被压缩的字符串与上述源字符串的对应关系。
上述待压缩的源数据可以是文本数据、图像数据、音频数据等。被压缩的字符串可以是上述源数据中非首次出现的且长度超过阈值的字符串,上述阈值可以是3、4、5、6等。上述源字符串可以为上述源数据中首次出现的和/或长度未超过上述阈值的字符串。可以理解,通过上述描述信息和上述源字符串可以得到被压缩的字符串。本发 明实施例中,利用上述描述信息代替重复出现的字符串,即被压缩的字符串用上述描述信息代替,进而实现对上述源数据的压缩编码。举例来说,待压缩的源数据为“AAAABCDAAAA”,阈值为4,则源字符串为“AAAABCD”,被压缩的字符串为该源数据中后面的“AAAA”,描述信息为(7,7,4)。其中,该描述信息中的第一个数值表示被压缩的字符串与该源字符串的位置关系,第二个数值表示被压缩的字符串在该源字符串中的起始位置,第三个数值表示被压缩的字符串的长度。可以看出,在源字符串的第7个字符之后为被压缩的字符串;从被压缩的字符串的起始位置向前7个字节为该被压缩的字符串在该源字符串中的起始位置,被压缩的字符串的长度为4。又举例来说,待压缩的源数据为“AAAABCDAAAABEFBCDAEFBC”,阈值为4,则源字符串为“AAAABCDEF”,被压缩的字符串依次为该源数据中第二次出现的“AAAAB”、“BCDA”以及“EFBC”,3个被压缩的字符串对应的3个描述信息依次为(7,7,5),(2,10,4),(0,6,4)。从上述例子可以看出,该源数据中的前7个字符首次出现,不能被压缩编码,属于源字符串;第一次出现的“EF”不能被编码,属于源字符串;因此源字符串为“AAAABCDEF”。从上述例子可以看出,在源字符串的第7个字符之后为该源数据中第二次出现的“AAAAB”,从该源数据中第二次出现的“AAAAB”的起始位置向前7个字节为该“AAAAB”在该源字符串中的起始位置,该源数据中第二次出现的“AAAAB”的长度为5;在源字符串的第(7+2)个字符之后为该源数据中第二次出现的“BCDA”,即源字符串中的“EF”之后为该“BCDA”,该源数据中第二次出现的“BCDA”的起始位置向前10个字节为该“BCDA”在该源字符串中的起始位置,该源数据中第二次出现的“BCDA”的长度为4;在源字符串的第(7+2+0)个字符之后为该源数据中第二次出现的“EFBC”,即该“EFBC”前面未相邻源字符串,该“EFBC”的起始位置向前6个字节为该“EFBC”在该源字符串中的起始位置,该源数据中第二次出现的“EFBC”的长度为4。上述描述信息中的第一个数值可以理解为被压缩的字符串前面相邻的源字符串的长度。上述例子中,源数据中第二次出现的“BCDA”相邻的源字符串为“EF”,该“BCDA”对应的描述信息的第一个数值为2,即该“EF”的长度;该源数据中第二次出现的“EFBC”未相邻源字符串,该“EFBC”对应的描述信息的第一个数值为0。
采用压缩算法编码上述源数据需要获得上述源数据的源字符串以及描述信息;对获得的上述源字符串以及上述描述信息进行编码,得到压缩数据。在实际应用中,可以采用多种方式获得上述源数据的源字符串以及描述信息,并且可以采用多种形式的描述信息描述被压缩的字符串与上述源字符串的对应关系。本发明实施例不限定获取源字符串和描述信息的方式,以及上述描述信息的具体形式。
本发明实施例提供了一种获取源数据的源字符串和描述信息的方法,上述获取上述源数据的源字符串以及描述信息包括:
按照位于上述源数据中的前后顺序依次获取上述源数据中首次出现的字符串,得到上述源字符串;
采用哈希算法搜索上述源数据中与上述目标字符串相匹配的字符串;
在搜索到与上述目标字符串相匹配的上述参考字符串的情况下,确定上述目标字符串为可被压缩编码的字符串;
获得第一数值、第二数值以及第三数值,生成上述目标字段;上述第一数值表示上述目标字符串与上述源字符串的位置关系;上述第二数值表示上述目标字符串在上述源字符串中的起始位置;上述第三数值表示上述目标字符串的长度。
本发明实施例中,可以采用哈希算法搜索上述源数据中的源字符串以及可被压缩的字符串。具体的,采用哈希算法计算上述源数据中各个字符串的哈希值,通过比较哈希值确定各个字符串的匹配情况,进而获得上述源字符串以及上述目标字段。下面以Lz4压缩算法中采用哈希算法搜索源数据中的源字符串以及可被压缩的字符串为例进行介绍:
(1)从源数据中取出4个字节。
例如4个字节的源数据为“AAAA”,A的ASCII码是0x41,AAAA就是0x41414141=1094795585。从该源数据中每次取出的字节个数等于阈值。本实施例中,被压缩字符串的阈值为4,即被压缩字符串的最小长度为4。
(2)乘以黄金分割素数。
1094795585*2654435761=2906064551808915185=0x28546A5018ECD6F1,其中,2654435761为黄金分割素数。
(3)取低32位的高13位作为关键码值。
0x18ECD6F1>>19=0x31D=797。“>>19”表示向右移动19位。“AAAA”的哈希值为797,即Hash(“AAAA”)=797。
(4)在一张存储数据地址的表的第797个位置查看是否已经存在有效地址,如果存在,哈希粗匹配成功,然后进一步去对比那个有效地址所对应的内容是否是“AAAA”;如果不存在,在该表的第797个位置存储“AAAA”的有效地址。
(5)在上述有效地址所对应的内容是“AAAA”的情况下,生成上述4个字节中的数据对应的描述信息。
可以理解,每次从源数据中取出4个字节,然后乘以4字节的黄金分割数(在0xFFFFFFFF*0.618黄金分割附近的素数),然后取高13位,作为输出的哈希值。可见,通过哈希算法可以快速地搜索出源字符串以及可被编码的字符串,提高编码效率。哈希算法的实现方式有多种,本发明实施例不作限定。
本发明实施例采用哈希算法搜索源数据中可被编码的字符串,并生成相应的描述信息,时间开销小。
602、分别计算至少两种压缩算法编码上述描述信息所需占用的存储空间。
上述至少两种压缩算法可以包含Lz4压缩算法、data-shrinker压缩算法、以及Lizard压缩算法等。编解码装置可以预置有上述至少两种压缩算法以及上述至少两种压缩算法分别对应的字节输出模型,例如Lz4压缩算法的字节输出模型、data-shrinker压缩算法的字节输出模型以及Lizard压缩算法的字节输出模型等。上述编解码装置可以是手机、电脑、平板电脑、以及其他可实现编解码功能的设备。本发明实施例不限定上述至少两种算法。通过上述至少两种压缩算法分别对应的字节输出模型可以分别计算上述至少两种压缩算法编码上述描述信息所需占用的存储空间。具体的,将上述描述信息对应的数值分别代入到上述至少两种压缩算法对应的字节输出模型进行计算,得到上述至少两种压缩算法编码上述描述信息所需占用的存储空间。
603、选择上述至少两种压缩算法编码上述描述信息所需占用的存储空间中占用存储空间较小的压缩算法作为目标算法。
上述至少两种压缩算法编码上述描述信息所需占用的存储空间中占用存储空间较小的压缩算法可以为上述至少两种压缩算法编码上述描述信息所需占用的存储空间中占用存储空间非最大的任一种压缩算法。也就是说,可以选择除所需占用的存储空间最大的压缩算法之外的任一种压缩算法为上述目标算法,例如选择上述至少两种压缩算法中编码上述描述信息所需占用的存储空间最小的压缩算法为上述目标算法。举例来说,第一压缩算法至第五压缩算法5种压缩算法中该第一压缩算法编码描述信息所需占用的存储空间最大,可以选择除该第一压缩算法之外的任一种压缩算法作为目标算法。
604、使用上述目标算法对上述源数据进行压缩编码,得到压缩数据。
所述使用上述目标算法对上述源数据进行压缩编码,得到压缩数据可以是采用所述目标算法对应的编码方式对所述源字符串以及所述描述信息进行压缩编码,得到上述源数据对应的压缩数据。可以理解,本发明实施例中,对于不同压缩算法,可以仅遍历一次源数据,得到所述源字符串和所述描述信息,并存储;在确定所述目标算法后,采用所述目标算法对应的编码方式编码所述源字符串和所述描述信息,得到上述源数据对应的压缩数据。在待压缩的源数据和阈值确定的情况下,该待压缩的源数据仅对应的一个确定的源字符串以及描述信息。也就是说,不同的压缩算法编码同一待压缩数据所依据的描述信息和源字符串是相同,每种压缩算法均可以根据该描述信息和该源字符串进行压缩编码,得到相应的压缩数据。也就是说,本发明实施例中,仅需要遍历一次源数据,得到描述信息和源字符串。
本发明实施例通过在对上述源数据进行压缩编码之前,选择编码上述源数据所需占用的存储空间较小的目标算法;并使用上述目标算法对上述源数据进行压缩编码;可以在不显著增加编码时间的条件下,明显提高压缩比。本发明实施例中,仅需计算上述至少两种压缩算法编码上述描述信息所需占用的存储空间,不需要采用上述目标算法之外的压缩算法执行编码操作,编码开销较小。
在一种可选的实现方式中,上述压缩数据包含指示字段,上述指示字段指示上述目标算法。
上述指示字段可以占用至少一个比特位指示上述目标算法。可以理解,上述指示字段对应的二进制序列不同,指示的压缩算法不同。举例来说,编解码装置预置有4种压缩算法,即可采用4种压缩算法中的任一种进行压缩编码,指示字段占用两个比特位,00指示第一压缩算法,01指示第二压缩算法,10指示第三压缩算法,11指示第四压缩算法;若目标算法为该第四压缩算法,则该指示字段为11。又举例来说,编解码装置预置有8种压缩算法,指示字段占用3个比特位,000指示第一压缩算法,001指示第二压缩算法,010指示第三压缩算法,011指示第四压缩算法,100指示第五压缩算法,101指示第六压缩算法,110指示第七压缩算法,111指示第八压缩算法;若目标算法为该第四压缩算法,则该指示字段为011。
本发明实施例通过指示字段指示编码源数据所采用的压缩算法,以便于在解压该源数据压缩编码得到的压缩数据的时,采用该压缩算法对应的解压算法进行解压,提 高解压效率。
在一种可选的实现方式中,上述描述信息包含目标字段;上述目标字段用于描述目标字符串与上述源字符串的对应关系,上述目标字符串属于被压缩的字符串;上述目标字段包含第一数值、第二数值以及第三数值;上述第一数值表示上述目标字符串与上述源字符串的位置关系;上述第二数值表示上述目标字符串在上述源字符串中的起始位置;上述第三数值表示上述目标字符串的长度。
可以理解,通过上述目标字段和上述源字符串可以得到上述目标字符串。因此,可以利用上述目标字段代替上述目标字符串。上述第一数值表示上述目标字符串与上述源字符串的位置关系,通过上述第一数值可以确定上述目标字符串在上述源数据中的位置。通过上述第二数值和上述第三数值可以确定上述目标字符串。具体的,从上述第二数值指示的起始位置开始获取上述第三数值个上述源字符串中的字符,得到上述目标字符串。举例来说,待压缩的源数据为“AAAABCDAAAABEFBCDAEFBC”,阈值为4,则源字符串为“AAAABCDEF”,目标字符串为该源数据中第二次出现的“AAAAB”,目标字段为(7,7,5)。从上述例子可以看出,在源字符串的第7个字符之后为该目标字符串,从目标字符串的起始位置向前7个字节为该目标字符串在该源字符串中的起始位置,该目标字符串的长度为5。上述第一数值也可以理解为上述目标字符串前面相邻的源字符串的长度,即上述目标字符串前面相邻的未被编码的字符串的长度。在实际应用中,可以采用其他形式的目标字段描述目标字符串与上述源字符串的对应关系,本发明实施例不作限定。
本发明实施例中,利用目标字段描述目标字符串与源字符串的对应关系,以便于利用上述目标字段和上述源字符串准确、快速地确定上述目标字符串,编码效率高。
对于源字符串和描述信息的存储方式可以采用以下方式中的任意一种实现:
方式一:分别存储上述源字符串和上述描述信息。
举例来说,待压缩的源数据为“AAAABCDAAAABEFBCDAEFBC”,阈值为4,则源字符串为“AAAABCDEF”,被压缩的字符串依次为该源数据中第二次出现的“AAAAB”、“BCDA”以及“EFBC”,这3个被压缩的字符串对应的3个描述信息依次为(7,7,5),(2,10,4),(0,6,4);分别存储该源字符串“AAAABCDEF”和这三个描述信息,存储的描述信息依次为(7,7,5),(2,10,4),(0,6,4)。
可选地,可以按照上述描述信息中各个描述信息生成的顺序依次进行存储。可以理解,在这种方式下,上述源字符串和上述描述信息存储在不同的存储空间。
在这种方式下,压缩编码源数据的一种实现方式如下:从上述描述信息中获取上述目标字段;从上述源字符串中获取第一字符串;上述第一字符串为上述源数据中上述目标字符串相邻的字符串且位于上述目标字符串之前;采用上述目标算法依据上述第一数值、上述第二数值、上述第三数值以及上述第一字符串生成第一数据片段的压缩数据,上述第一数据片段包含上述第一字符串和上述目标字符串。
上述从上述源字符串中获取第一字符串可以是从上述源字符串中首个未被提取的字符串开始提取上述第一数值个字符,得到上述第一字符串。假定待压缩的源数据为“AAAABCDAAAABEFBCDAEFBC”,阈值为4,则源字符串为“AAAABCDEF”,描述信息依次为(7,7,5),(2,10,4),(0,6,4)。从该源字符串中首个未被提取的字 符开始提取7个字符,得到“AAAABCD”,编码“AAAABCD”和(7,7,5),得到“AAAABCDAAAAB”的压缩数据;从该源字符串中首个未被提取的字符开始提取2个字符,得到“EF”,编码“EF”和(2,10,4),得到“EFBCDA”的压缩数据;从该源字符串中首个未被提取的字符开始提取0个字符,未得到字符串,编码(2,10,4),得到“EFBC”的压缩数据。图7为采用Lz4压缩算法编码源字符串“AAAABCD”和描述信息(7,7,5)得到的码流单元的结构示意图,对应“AAAABCDAAAAB”的压缩数据;701包含该描述信息的第一个数值和第三个数值,702为该源字符串,703为该描述信息的第二个数值。图8为采用Lz4压缩算法编码源字符串“EF”和描述信息(2,10,4)得到的码流单元的结构示意图,对应“EFBCDA”的压缩数据;801包含该描述信息的第一个数值和第三个数值,802为该源字符串,803为该描述信息的第二个数值。图9为采用Lz4压缩算法编码描述信息(0,6,4)得到的码流单元的结构示意图,对应“EFBC”的压缩数据;901包含该描述信息的第一个数值和第三个数值,902为该描述信息的第二个数值。本发明实施例中,可以依次编码存储的上述描述信息中的各描述信息,得到上述源数据的压缩数据。
采用上述方式,利用目标字段和源字符串可以快速地生成第一数据片段的压缩数据,实现简单,编码效率高。
方式二:存储目标信息;上述目标信息包含上述目标字段和第二字符串;上述第二字符串为上述源数据中上述目标字符串相邻的字符串且位于上述目标字符串之前。
上述存储目标信息可以是合并上述目标字段和上述第二字符串,得到上述目标信息,并进行存储。假定目标字段为(7,7,5),第二字符串为“AAAABCD”,则目标信息为(“AAAABCD”,7,7,5)。假定目标字段为(2,10,4),第二字符串为“EF”,则目标信息为(“EF”,2,10,4)。可以理解,在这种方式下,上述源字符串和上述描述信息作为一份数据存储在同一个存储空间。
在这种方式下,压缩编码源数据的一种实现方式如下:获取上述目标信息;采用上述目标算法依据上述第一数值、上述第二数值、上述第三数值以及上述第二字符串生成第二数据片段的压缩数据;上述第二数据片段包含上述第二字符串和上述目标字符串。
假定待压缩的源数据为“AAAABCDAAAABEFBCDAEFBC”,阈值为4,则源字符串为“AAAABCDEF”,描述信息依次为(7,7,5)、(2,10,4)、(0,6,4),存储的信息包括(“AAAABCD”,7,7,5)、(“EF”,2,10,4)、(“”,0,6,4),其中(“”,0,6,4)表示EFBC没有相邻的源字符串,即EFBC都是需要被压缩的字符串。假定目标算法为Lz4压缩算法,目标字段为(“AAAABCD”,7,7,5),采用该目标算法编码该目标字段,得到如图7所示的码流单元,对应“AAAABCDAAAAB”的压缩数据。
采用这种方式,利用目标字段可以快速地生成第二数据片段的压缩数据,实现简单,可以节省编码时间。
本发明实施例中可以采用以下方式解压压缩数据:解析上述压缩数据,得到上述指示字段指示的上述目标算法;利用上述目标算法对应的解压算法对上述压缩数据进行解压,得到上述源数据。
上述解析上述压缩数据,得到上述指示字段指示的上述目标算法可以是解析上述 指示字段,依据上述指示字段确定上述目标算法。上述编解码装置预置有上述指示字段与压缩算法的对应关系。上述编解码装置在解析出上述指示字段后,可以依据上述指示字段与压缩算法的对应关系确定编码上述源数据所采用的压缩算法,即上述目标算法。
本发明实施例通过解析压缩数据的指示字段确定解压该压缩数据所需采用的解压算法,可以准确、快速地完成解压操作。
本发明实施例提供了一种编解码方法的具体举例,如图10所示,包括:
1001、在遍历源数据的过程中,获取上述源数据的源字符串和描述信息。
上述源数据对应至少一个描述信息。在遍历上述源数据的过程中,可以依次获取上述源数据对应的多个描述信息。具体实现方式与图6中的方式相同。
1002、存储获取到的源字符串和描述信息。
在遍历源数据的过程中,可以按照先后顺序依次存储获取到的描述信息和源字符串。
1003、计算至少两种压缩算法编码描述信息所需占用的存储空间。
具体实现方式与图6中的方式相同。
1004、判断遍历上述源数据的操作是否完成。
若是,执行1005;若否,执行1001。
1005、选择上述至少两种压缩算法编码描述信息所需占用的存储空间中占用存储空间较小的压缩算法作为目标算法。
具体实现方式与图6中的步骤603相同。
1006、使用上述目标算法对上述源数据进行压缩编码,得到压缩数据。
具体实现方式与图6中的步骤604相同。对比图5和图10可以看出,本发明实施例中,在对源数据进行压缩编码之前,计算至少两种压缩算法编码描述信息所需占用的存储空间,并选择编码该描述信息所需占用的存储空间较小的压缩算法作为目标算法;在选择该目标算法后,利用该目标算法编码该源数据,得到压缩数据。
本发明实施例通过在对上述源数据进行压缩编码之前,选择编码上述源数据所需占用的存储空间较小作为目标算法;并使用上述目标算法对上述源数据进行压缩编码;可以在不显著增加编码时间的条件下,明显提高压缩比。
图11示出了本发明实施例提供的一种编解码装置的功能框图。编解码装置的功能块可由硬件、软件或硬件与软件的组合来实施本发明方案。所属领域的技术人员应理解,图11中所描述的功能块可经组合或分离为若干子块以实施本发明方案。因此,本发明中上面描述的内容可支持对下述功能模块的任何可能的组合或分离或进一步定义。
如图11所示,编解码可包括:
获取单元1101,用于获取待压缩的源数据;
确定单元1102,用于根据上述待压缩的源数据确定源字符串以及描述信息;上述源字符串为上述源数据中不被压缩的字符串,上述描述信息用于描述被压缩的字符串与上述源字符串的对应关系;
计算单元1103,用于分别计算至少两种压缩算法编码上述描述信息所需占用的存 储空间;
选择单元1104,用于选择上述至少两种压缩算法编码上述描述信息所需占用的存储空间中占用存储空间较小的压缩算法作为目标算法;
编码单元1105,用于使用上述目标算法对上述源数据进行压缩编码,得到压缩数据。
本发明实施例通过在对源数据进行压缩编码之前,选择编码该源数据所需占用的存储空间较小的目标算法;并使用该目标算法对该源数据进行压缩编码;可以在不显著增加编码时间的条件下,明显提高压缩比,减小占用的存储空间。
在一种可选的实现方式中,上述压缩数据包含指示字段,上述指示字段指示上述目标算法。
本发明实施例通过指示字段指示编码源数据所采用的压缩算法,以便于在解压该源数据压缩编码得到的压缩数据的时,采用该压缩算法对应的解压算法进行解压,提高解压效率。
在一种可选的实现方式中,上述描述信息包含目标字段;上述目标字段用于描述目标字符串与上述源字符串的对应关系,上述目标字符串属于被压缩的字符串;上述目标字段包含第一数值、第二数值以及第三数值;上述第一数值表示上述目标字符串与上述源字符串的位置关系;上述第二数值表示上述目标字符串在上述源字符串中的起始位置;上述第三数值表示上述目标字符串的长度。
本发明实施例中,利用目标字段描述目标字符串与源字符串的对应关系,以便于利用该目标字段和该源字符串准确地确定该目标字符串,编码效率高。
在一种可选的实现方式中,上述编解码装置还包括:
第一存储单元1106,用于分别存储上述源字符串和上述描述信息;
上述编码单元1105,具体用于从上述描述信息中获取上述目标字段;从上述源字符串中获取第一字符串;上述第一字符串为上述源数据中上述目标字符串相邻的字符串且位于上述目标字符串之前;采用上述目标算法依据上述第一数值、上述第二数值、上述第三数值以及上述第一字符串生成第一数据片段的压缩数据,上述第一数据片段包含上述第一字符串和上述目标字符串。
本发明实施例中,利用目标字段和源字符串可以快速地生成第一数据片段的压缩数据,实现简单,编码效率高。
在一种可选的实现方式中,上述编解码装置还包括:
第二存储单元1107,用于存储目标信息;上述目标信息包含上述目标字段和第二字符串;上述第二字符串为上述源数据中上述目标字符串相邻的字符串且位于上述目标字符串之前;
上述编码单元1105,具体用于获取上述目标信息;采用上述目标算法依据上述第一数值、上述第二数值、上述第三数值以及上述第二字符串生成第二数据片段的压缩数据;上述第二数据片段包含上述第二字符串和上述目标字符串。
本发明实施例中,利用目标字段可以快速地生成第二数据片段的压缩数据,实现简单,可以节省编码时间。
在一种可选的实现方式中,上述编解码装置还包括:
解析单元1108,用于解析上述压缩数据,得到上述指示字段指示的上述目标算法;
解码单元1109,用于利用上述目标算法对应的解压算法对上述压缩数据进行解压,得到上述源数据。
本发明实施例通过解析压缩数据的指示字段确定解压该压缩数据所需采用的解压算法,可以准确、快速地完成解压操作。
在一种可选的实现方式中,上述获取单元1101,具体用于按照位于上述源数据中的前后顺序依次获取上述源数据中首次出现的字符串,得到上述源字符串;采用哈希算法搜索上述源数据中与上述目标字符串相匹配的字符串;在搜索到与上述目标字符串相匹配的上述参考字符串的情况下,确定上述目标字符串为可被压缩编码的字符串;生成上述目标字段。
本发明实施例采用哈希算法搜索源数据中可被编码的字符串,并生成相应的描述信息,时间开销小。
应理解的是,本发明实施例的编解码装置可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图6所示的编解码方法时,编解码装置及其各个模块也可以为软件模块。
根据本发明实施例的编解码装置可对应于执行本发明实施例中描述的方法,并且编解码装置中的各个单元的上述和其它操作和/或功能分别为了实现图6的各个方法的相应流程,为了简洁,在此不再赘述。
本发明实施例的编解码装置通过在对源数据进行压缩编码之前,选择编码该源数据所需占用的存储空间较小的目标算法;并使用该目标算法对该源数据进行压缩编码;可以在不显著增加编码时间的条件下,明显提高压缩比,减小占用的存储空间。
参见图12,是本发明另一实施例提供的一种编解码设备的示意框图。如图12所示,本实施例中的编解码设备可以包括:一个或多个处理器1201;一个或多个输入设备1202和存储器1203。上述处理器1201、输入设备1202以及存储器1203通过总线1204连接。存储器1203用于存储计算机程序,上述计算机程序包括程序指令,处理器1201用于执行存储器1203存储的程序指令。输入设备1202用于输入压缩指令。其中,处理器1201被配置用于调用上述程序指令执行:获取待压缩的源数据,根据上述待压缩的源数据确定源字符串以及描述信息;上述源字符串为上述源数据中不被压缩的字符串,上述描述信息用于描述被压缩的字符串与上述源字符串的对应关系;分别计算至少两种压缩算法编码上述描述信息所需占用的存储空间;选择上述至少两种压缩算法编码上述描述信息所需占用的存储空间中占用存储空间较小的压缩算法作为目标算法;使用上述目标算法对上述源数据进行压缩编码,得到压缩数据。
应当理解,在本发明实施例中,所称处理器1201可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、 分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。上述处理器1201可以实现如图11所示的获取单元1101、确定单元1102、计算单元1103、选择单元1104、编码单元1105、解析单元1108以及解码单元1109的功能。
该存储器1203包括但不限于是随机存储记忆体(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、或便携式只读存储器(Compact Disc Read-Only Memory,CD-ROM),该存储器可以用于存储相关指令及数据。上述存储器1203可以实现如图11所示的第一存储单元1106和第二存储单元1107的功能。
具体实现中,本发明实施例中所描述的处理器1201、输入设备1202以及存储器1203可执行本发明实施例提供的编解码方法所描述的实现方式,也可执行本发明实施例所描述的编解码装置的实现方式,在此不再赘述。
应理解,根据本发明实施例的编解码设备可对应于本发明实施例中图11所示的实现编解码的设备,并可以对应于执行根据本发明实施例图6的实现编解码方法中的相应主体,并且编解码设备中的各个模块的上述和其它操作和/或功能分别为了实现图6中的各个方法的相应流程,为了简洁,在此不再赘述。
本发明实施例的编解码设备通过在对源数据进行压缩编码之前,选择编码该源数据所需占用的存储空间较小的目标算法;并使用该目标算法对该源数据进行压缩编码;可以在不显著增加编码时间的条件下,明显提高压缩比,减小占用的存储空间。
在本发明的另一实施例中提供一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序包括程序指令,上述程序指令被处理器执行时实现:获取待压缩的源数据,根据上述待压缩的源数据确定源字符串以及描述信息;上述源字符串为上述源数据中不被压缩的字符串,上述描述信息用于描述被压缩的字符串与上述源字符串的对应关系;分别计算至少两种压缩算法编码上述描述信息所需占用的存储空间;选择上述至少两种压缩算法编码上述描述信息所需占用的存储空间中占用存储空间较小的压缩算法作为目标算法;使用上述目标算法对上述源数据进行压缩编码,得到压缩数据。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可 以是固态硬盘(solid state Drive,SSD)。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (13)

  1. 一种编解码方法,其特征在于,包括:
    获取待压缩的源数据,根据所述待压缩的源数据确定源字符串以及描述信息;所述源字符串为所述源数据中不被压缩的字符串,所述描述信息用于描述被压缩的字符串与所述源字符串的对应关系;
    分别计算至少两种压缩算法编码所述描述信息所需占用的存储空间;
    选择所述至少两种压缩算法编码所述描述信息所需占用的存储空间中占用存储空间较小的压缩算法作为目标算法;
    使用所述目标算法对所述源数据进行压缩编码,得到压缩数据。
  2. 根据权利要求1所述的编解码方法,其特征在于,所述压缩数据包含指示字段,所述指示字段指示所述目标算法。
  3. 根据权利要求1或2所述的编解码方法,其特征在于,所述描述信息包含目标字段;所述目标字段用于描述目标字符串与所述源字符串的对应关系,所述目标字符串属于被压缩的字符串;所述目标字段包含第一数值、第二数值以及第三数值;所述第一数值表示所述目标字符串与所述源字符串的位置关系;所述第二数值表示所述目标字符串在所述源字符串中的起始位置;所述第三数值表示所述目标字符串的长度。
  4. 根据权利要求1至3中任一所述的编解码方法,其特征在于,所述根据所述待压缩的源数据确定源字符串以及描述信息之后,所述方法还包括:
    分别存储所述源字符串和所述描述信息;
    所述使用所述目标算法对所述源数据进行压缩编码,得到压缩数据包括:
    从所述描述信息中获取所述目标字段;
    从所述源字符串中获取第一字符串;所述第一字符串为所述源数据中所述目标字符串相邻的字符串且位于所述目标字符串之前;
    采用所述目标算法依据所述第一数值、所述第二数值、所述第三数值以及所述第一字符串生成第一数据片段的压缩数据,所述第一数据片段包含所述第一字符串和所述目标字符串。
  5. 根据权利要求1至3中任一所述的编解码方法,其特征在于,所述根据所述待压缩的源数据确定源字符串以及描述信息之后,所述方法还包括:
    存储目标信息;所述目标信息包含所述目标字段和第二字符串;所述第二字符串为所述源数据中所述目标字符串相邻的字符串且位于所述目标字符串之前;
    所述使用所述目标算法对所述源数据进行压缩编码,得到压缩数据包括:
    获取所述目标信息;
    采用所述目标算法依据所述第一数值、所述第二数值、所述第三数值以及所述第二字符串生成第二数据片段的压缩数据;所述第二数据片段包含所述第二字符串和所述目标字符串。
  6. 根据权利要求1至5任一所述的编解码方法,其特征在于,所述根据所述待压缩的源数据确定源字符串以及描述信息包括:
    按照位于所述源数据中的前后顺序依次获取所述源数据中首次出现的字符串,得到所述源字符串;
    采用哈希算法搜索所述源数据中与所述目标字符串相匹配的字符串;
    在搜索到与所述目标字符串相匹配的字符串的情况下,确定所述目标字符串为可被压缩编码的字符串;
    获得所述第一数值、所述第二数值以及所述第三数值,生成所述目标字段。
  7. 一种编解码装置,其特征在于,包括:
    获取单元,用于获取待压缩的源数据;
    确定单元,用于根据所述待压缩的源数据确定源字符串以及描述信息;所述源字符串为所述源数据中不被压缩的字符串,所述描述信息用于描述被压缩的字符串与所述源字符串的对应关系;
    计算单元,用于分别计算至少两种压缩算法编码所述描述信息所需占用的存储空间;
    选择单元,用于选择所述至少两种压缩算法编码所述描述信息所需占用的存储空间中占用存储空间较小的压缩算法作为目标算法;
    编码单元,用于使用所述目标算法对所述源数据进行压缩编码,得到压缩数据。
  8. 根据权利要求7所述的编解码装置,其特征在于,所述压缩数据包含指示字段,所述指示字段指示所述目标算法。
  9. 根据权利要求7或8所述的编解码装置,其特征在于,所述描述信息包含目标字段;所述目标字段用于描述目标字符串与所述源字符串的对应关系,所述目标字符串属于被压缩的字符串;所述目标字段包含第一数值、第二数值以及第三数值;所述第一数值表示所述目标字符串与所述源字符串的位置关系;所述第二数值表示所述目标字符串在所述源字符串中的起始位置;所述第三数值表示所述目标字符串的长度。
  10. 根据权利要求7至9中任一所述的编解码装置,其特征在于,所述编解码装置还包括:
    第一存储单元,用于分别存储所述源字符串和所述描述信息;
    所述编码单元,具体用于从所述描述信息中获取所述目标字段;从所述源字符串中获取第一字符串;所述第一字符串为所述源数据中所述目标字符串相邻的字符串且位于所述目标字符串之前;采用所述目标算法依据所述第一数值、所述第二数值、所述第三数值以及所述第一字符串生成第一数据片段的压缩数据,所述第一数据片段包含所述第一字符串和所述目标字符串。
  11. 根据权利要求7至9中任一所述的编解码装置,其特征在于,所述编解码装置还包括:
    第二存储单元,用于存储目标信息;所述目标信息包含所述目标字段和第二字符串;所述第二字符串为所述源数据中所述目标字符串相邻的字符串且位于所述目标字符串之前;
    所述编码单元,具体用于获取所述目标信息;采用所述目标算法依据所述第一数值、所述第二数值、所述第三数值以及所述第二字符串生成第二数据片段的压缩数据;所述第二数据片段包含所述第二字符串和所述目标字符串。
  12. 根据权利要求7至11任一所述的编解码装置,其特征在于,所述获取单元,具体用于按照位于所述源数据中的前后顺序依次获取所述源数据中首次出现的字符串,得到所述源字符串;采用哈希算法搜索所述源数据中与所述目标字符串相匹配的字符串;在搜索到与所述目标字符串相匹配的字符串的情况下,确定所述目标字符串为可被压缩编码的字符串;生成所述目标字段。
  13. 一种编解码设备,其特征在于,包括处理器和存储器,所述处理器和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行如权利要求1-6任一项所述的方法。
PCT/CN2018/100615 2018-02-08 2018-08-15 编解码方法、装置及编解码设备 WO2019153700A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810133325.5 2018-02-08
CN201810133325.5A CN108322220A (zh) 2018-02-08 2018-02-08 编解码方法、装置及编解码设备

Publications (1)

Publication Number Publication Date
WO2019153700A1 true WO2019153700A1 (zh) 2019-08-15

Family

ID=62903950

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/100615 WO2019153700A1 (zh) 2018-02-08 2018-08-15 编解码方法、装置及编解码设备

Country Status (2)

Country Link
CN (1) CN108322220A (zh)
WO (1) WO2019153700A1 (zh)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108322220A (zh) * 2018-02-08 2018-07-24 华为技术有限公司 编解码方法、装置及编解码设备
CN109067405B (zh) * 2018-07-27 2022-10-11 深圳市元征科技股份有限公司 一种数据压缩的方法、装置、终端及计算机可读存储介质
CN110378457B (zh) * 2019-06-26 2023-06-20 全球码链科技有限公司 一种码标的生成方法及装置
CN111767280A (zh) * 2020-04-17 2020-10-13 北京沃东天骏信息技术有限公司 数据处理方法、装置及存储介质
CN113765854B (zh) * 2020-06-04 2023-06-30 华为技术有限公司 一种数据压缩方法及服务器
CN111817722A (zh) * 2020-07-09 2020-10-23 北京奥星贝斯科技有限公司 数据压缩方法、装置及计算机设备
CN112711935B (zh) * 2020-12-11 2023-04-18 中国科学院深圳先进技术研究院 编码方法、解码方法、装置及计算机可读存储介质
CN112713899B (zh) * 2020-12-18 2021-10-08 广东高云半导体科技股份有限公司 Fpga码流数据的压缩方法、装置及存储介质
CN113271108A (zh) * 2021-05-25 2021-08-17 上海众言网络科技有限公司 问卷答题数据传输方法及装置
CN113676375B (zh) * 2021-08-13 2023-03-14 浙江大学 一种工业控制系统私有协议结构解析方法
CN115002465A (zh) * 2022-05-30 2022-09-02 深圳市吉迩科技有限公司 基于嵌入式系统图片的无损压缩算法、装置、计算机设备及存储介质
US11995058B2 (en) * 2022-07-05 2024-05-28 Sap Se Compression service using FPGA compression
CN115499016A (zh) * 2022-11-15 2022-12-20 中科声龙科技发展(北京)有限公司 基于二进制的数据处理的方法、装置、设备及存储介质
CN115589436B (zh) * 2022-12-14 2023-03-28 三亚海兰寰宇海洋信息科技有限公司 一种数据处理方法、装置及设备

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1356787A (zh) * 2000-11-24 2002-07-03 松下电器产业株式会社 声音信号编码设备和方法
CN101237301A (zh) * 2008-02-22 2008-08-06 深圳市深信服电子科技有限公司 动态数据压缩技术
CN101287058A (zh) * 2007-02-21 2008-10-15 三星电子株式会社 数据文件压缩设备及其方法
CN101448066A (zh) * 2007-11-28 2009-06-03 三星Techwin株式会社 控制文件压缩比的方法和设备
CN102594360A (zh) * 2012-02-01 2012-07-18 浪潮(北京)电子信息产业有限公司 一种计算机数据压缩方法及装置
CN103929185A (zh) * 2013-01-10 2014-07-16 国际商业机器公司 实时减少数据压缩的中央处理单元开销的方法和系统
CN104462334A (zh) * 2014-12-03 2015-03-25 天津南大通用数据技术股份有限公司 一种列存数据库的数据压缩方法及装置
CN107592117A (zh) * 2017-08-15 2018-01-16 深圳前海信息技术有限公司 基于Deflate的压缩数据块输出方法及装置
CN108322220A (zh) * 2018-02-08 2018-07-24 华为技术有限公司 编解码方法、装置及编解码设备

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002100117A2 (en) * 2001-06-04 2002-12-12 Nct Group, Inc. A system and method for reducing the time to deliver information from a communications network to a user
US20070140353A1 (en) * 2005-12-19 2007-06-21 Sharp Laboratories Of America, Inc. Intra prediction skipping in mode selection for video compression
US8595199B2 (en) * 2012-01-06 2013-11-26 International Business Machines Corporation Real-time selection of compression operations
US9355613B2 (en) * 2012-10-09 2016-05-31 Mediatek Inc. Data processing apparatus for transmitting/receiving compression-related indication information via display interface and related data processing method
US10162700B2 (en) * 2014-12-23 2018-12-25 International Business Machines Corporation Workload-adaptive data packing algorithm
CN105653698A (zh) * 2015-12-30 2016-06-08 北京奇艺世纪科技有限公司 一种数据库表Hive Table的数据加载方法和装置
CN107066401B (zh) * 2016-12-30 2020-04-10 Oppo广东移动通信有限公司 一种基于移动终端架构的数据传输的方法及移动终端

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1356787A (zh) * 2000-11-24 2002-07-03 松下电器产业株式会社 声音信号编码设备和方法
CN101287058A (zh) * 2007-02-21 2008-10-15 三星电子株式会社 数据文件压缩设备及其方法
CN101448066A (zh) * 2007-11-28 2009-06-03 三星Techwin株式会社 控制文件压缩比的方法和设备
CN101237301A (zh) * 2008-02-22 2008-08-06 深圳市深信服电子科技有限公司 动态数据压缩技术
CN102594360A (zh) * 2012-02-01 2012-07-18 浪潮(北京)电子信息产业有限公司 一种计算机数据压缩方法及装置
CN103929185A (zh) * 2013-01-10 2014-07-16 国际商业机器公司 实时减少数据压缩的中央处理单元开销的方法和系统
CN104462334A (zh) * 2014-12-03 2015-03-25 天津南大通用数据技术股份有限公司 一种列存数据库的数据压缩方法及装置
CN107592117A (zh) * 2017-08-15 2018-01-16 深圳前海信息技术有限公司 基于Deflate的压缩数据块输出方法及装置
CN108322220A (zh) * 2018-02-08 2018-07-24 华为技术有限公司 编解码方法、装置及编解码设备

Also Published As

Publication number Publication date
CN108322220A (zh) 2018-07-24

Similar Documents

Publication Publication Date Title
WO2019153700A1 (zh) 编解码方法、装置及编解码设备
US7667630B2 (en) Information compression-encoding device, its decoding device, method thereof, program thereof, and recording medium storing the program
US8933825B2 (en) Data compression systems and methods
US9077368B2 (en) Efficient techniques for aligned fixed-length compression
US10187081B1 (en) Dictionary preload for data compression
US20190034091A1 (en) Methods, Devices and Systems for Compressing and Decompressing Data
JP2000315954A (ja) 入力データストリームの圧縮方法とその装置
JP2003218703A (ja) データ符号化装置及びデータ復号装置
WO2019080670A1 (zh) 基因测序数据压缩解压方法、系统及计算机可读介质
JPH05241777A (ja) データ圧縮方式
Jacob et al. Comparative analysis of lossless text compression techniques
US20080001790A1 (en) Method and system for enhancing data compression
JP3241787B2 (ja) データ圧縮方式
US11967975B1 (en) Method and apparatus for recursive data compression using seed bits
JPH0628149A (ja) 複数種類データのデータ圧縮方法
Bharathi et al. A plain-text incremental compression (pic) technique with fast lookup ability
Vasanthi et al. Implementation of Robust Compression Technique Using LZ77 Algorithm on Tensilica's Xtensa Processor
JPH05152971A (ja) データ圧縮・復元方法
JP3565147B2 (ja) 復号装置
Rani et al. An Enhanced Text Compression System Based on ASCII Values and Huffman Coding
Pannirselvam et al. A Comparative Analysis on Different Techniques in Text Compression
Islam et al. Redundant Reduced LZW (RRLZW) Technique of Lossless Data Compression.
Nadarajan et al. Analysis of string matching compression algorithms
JPH06274311A (ja) データ圧縮装置及びデータ復元装置
Lashin et al. PERFORMANCE EVALUATION OF DATA COMPRESSION TECHNIQUES VERSUS DIFFERENT TYPES OF DATA

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18904507

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18904507

Country of ref document: EP

Kind code of ref document: A1