WO2023231313A1 - 数据压缩方法、装置、设备及存储介质 - Google Patents
数据压缩方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2023231313A1 WO2023231313A1 PCT/CN2022/132677 CN2022132677W WO2023231313A1 WO 2023231313 A1 WO2023231313 A1 WO 2023231313A1 CN 2022132677 W CN2022132677 W CN 2022132677W WO 2023231313 A1 WO2023231313 A1 WO 2023231313A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- character
- encoded
- data code
- frequency
- data
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 84
- 238000013144 data compression Methods 0.000 title claims abstract description 44
- 238000012545 processing Methods 0.000 claims abstract description 27
- 238000004590 computer program Methods 0.000 claims abstract description 24
- 238000007906 compression Methods 0.000 claims description 53
- 230000006835 compression Effects 0.000 claims description 53
- 230000005540 biological transmission Effects 0.000 abstract description 26
- 238000013473 artificial intelligence Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 6
- 238000013461 design Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 208000000044 Amnesia Diseases 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000006837 decompression Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 231100000863 loss of memory Toxicity 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/02—Conversion to or from weighted codes, i.e. the weight given to a digit depending on the position of the digit within the block or code word
- H03M7/04—Conversion to or from weighted codes, i.e. the weight given to a digit depending on the position of the digit within the block or code word the radix thereof being two
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
Definitions
- the present application relates to the field of data processing technology, and in particular, to a data compression method, device, equipment and storage medium.
- This application provides a data compression method, device, equipment and storage medium to compress data, thereby improving bandwidth usage efficiency and meeting the ever-increasing demand for data transmission.
- this application provides a data compression method, including:
- the non-idle strings include data codes, and the distribution of the data codes conforms to the normal distribution;
- a first data code is obtained, and the first data code includes at least one character to be encoded;
- the compression result of the character string to be processed is obtained.
- this application provides a data compression device, including:
- the first compression module is used to determine non-idle strings among the strings to be processed, where the non-idle strings include data codes, and the distribution of the data codes conforms to a normal distribution;
- a processing module configured to obtain a first data code based on the data code and the average value of the data code, where the first data code includes at least one character to be encoded;
- a second compression module configured to perform binary encoding on at least one to-be-encoded character in the first data code to obtain a second data code
- Obtaining module configured to obtain the compression result of the string to be processed based on other characters in the string to be processed except the data code, the first data code and the second data code.
- the present application provides an electronic device, including: a processor, a memory, and computer program instructions stored on the memory and executable on the processor.
- the processor executes the computer program instructions, the following is implemented: The data compression method described in the first aspect above.
- the present application provides a computer-readable storage medium.
- Computer instructions are stored in the computer-readable storage medium. When the computer instructions are executed by a processor, they are used to implement data compression as described in the first aspect. method.
- the present application provides a computer program product, including a computer program that implements the data compression method described in the first aspect when executed by a processor.
- the data compression method, device, equipment and storage medium provided by this application realize first-level compression of data by determining the non-idle strings among the strings to be processed during the data compression process.
- the data code and its average value are used to obtain the first data code.
- the first data code includes at least one character to be encoded.
- binary encoding is performed on at least one character to be encoded in the first data code to achieve secondary compression of the data. Therefore, the compressed data can be transmitted twice during data transmission, which improves bandwidth usage efficiency and meets the increasing data transmission needs.
- the resources occupied by the data are reduced, power consumption is reduced, and energy costs are reduced.
- Figure 1A is a schematic diagram of an application scenario of data compression provided by an embodiment of the present application.
- Figure 1B is a schematic diagram of another application scenario of data compression provided by an embodiment of the present application.
- FIG. 2 is a schematic flow chart of Embodiment 1 of the data compression method provided by this application;
- FIG. 3 is a schematic flow chart of Embodiment 2 of the data compression method provided by this application.
- FIG. 4 is a schematic flow chart of Embodiment 3 of the data compression method provided by this application.
- Figure 5 is a schematic structural diagram of an embodiment of a data compression device provided by this application.
- Figure 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- embodiments of the present application provide a data compression method.
- This method achieves first-level compression of data by determining the non-idle strings in the string to be processed. According to the data codes in the non-idle strings, and its average value to obtain a first data code.
- the first data code includes at least one character to be encoded. Furthermore, binary encoding is performed on at least one character to be encoded in the first data code to achieve secondary compression of the data.
- compressed data can be transmitted twice to improve bandwidth usage efficiency and meet the increasing data transmission needs.
- FIG. 1A is a schematic diagram of an application scenario of data compression provided by an embodiment of the present application.
- the application scenario may include a first device 11 and a second device 12 , and the first device 11 and the second device 12 may communicate in a wired or wireless manner.
- both the first device 11 and the second device 12 have compression and decompression capabilities.
- the first device 11 and/or the second device 12 can determine the non-idle character strings among the strings to be processed, and based on the data codes in the non-idle character strings and their average values, Obtain the first data code, the first data code includes at least one character to be encoded, and then perform binary encoding on at least one character to be encoded in the first data code to obtain the second data code, thereby dividing the character string according to the to-be-processed Other characters other than the above data code, the first data code and the second data code are used to obtain the compression result.
- the compression result is taken out and decompressed to obtain the characters to be processed before compression. string.
- the first device 11 can perform the above processing on the string to be processed, obtain the above compression result, and send the above compression result to the second device 12, and the second device 12 then decodes the above compression result. Compress to obtain the string to be processed before compression.
- the first device 11 and the second device 12 do not represent device serial numbers, but are only used to represent different devices.
- scenario shown in FIG. 1A may also include other devices, such as storage devices, which are not limited by the embodiments of the present application.
- FIG. 1B is a schematic diagram of another application scenario of data compression provided by the embodiment of the present application. This application scenario is explained in terms of data processing within the device. As shown in FIG. 1B , the device 10 in this application scenario includes a chip 101 and a memory 102 .
- a network on chip (NoC) 1011 and an artificial intelligence processor 1012 are deployed on the chip 101.
- the NoC 1011 can provide computing and communication functions. Therefore, the data running on the artificial intelligence processor 1012 can be processed by the NoC 1011 and written into the memory 102, and the data can be read from the memory 102, processed, and finally transmitted. to the artificial intelligence processor 1012.
- the NoC 1011 before the data running on the artificial intelligence processor 1012 is written into the memory 102 through the NoC 1011, the NoC 1011 can use the data compression method provided by this application to perform data compression, and then write the compression result into the memory 102, Correspondingly, after the NoC 1011 reads the data from the memory 102 and before transmitting it to the artificial intelligence processor 1012, it can decompress the read data, and then send the decompressed data to the artificial intelligence processor 1012 for calculation.
- the data running on the processor is compressed and stored in the memory 102, and the data read from the memory 102 is decompressed and then transmitted to the artificial intelligence processor 1012. Effectively save NOC bandwidth and improve resource utilization. Furthermore, processing through the data compression method provided in the embodiment of this application can further improve processing efficiency, save chip area and power consumption, shorten data transmission delay, and greatly improve chip performance. performance.
- FIG. 1B may also include other components, such as a transceiver, which is not limited by the embodiment of the present application.
- the memory 102 may include non-volatile and/or volatile media. Loss of memory will not be described here.
- the device that executes the embodiments of the present application may be a terminal device, a server, a virtual machine, etc., or a distributed computer system composed of one or more servers and/or computers, etc.
- the terminal equipment includes but is not limited to: smart phones, laptop computers, desktop computers, platform computers, vehicle-mounted equipment, smart wearable devices, etc.
- the server can be an ordinary server or a cloud server, and the cloud server is also called a cloud computing server or a cloud host. , is a host product in the cloud computing service system.
- the server can also be a server of a distributed system, or a server combined with a blockchain, etc., which are not limited in the embodiments of this application.
- the product implementation form of this application can be included in a software program and be deployed as program code on a device (it can also be hardware with computing capabilities such as a computing cloud or a mobile terminal).
- the program code of the present application may be stored inside the device executing the embodiments of the present application. At runtime, the program code runs on the device's central processing unit (CPU) and/or artificial intelligence processor chip.
- CPU central processing unit
- multiple refers to two or more.
- “And/or” describes the relationship between related objects, indicating that there can be three relationships. For example, A and/or B can mean: A exists alone, A and B exist simultaneously, and B exists alone. The character “/” generally indicates that the related objects are in an "or” relationship.
- FIG. 2 is a schematic flow chart of Embodiment 1 of the data compression method provided by this application. As shown in Figure 2, the data compression method may include the following steps:
- the distribution of the above data codes conforms to a normal distribution.
- data that conforms to the normal distribution can be compressed based on the technical solutions of the embodiments of this application.
- normal distribution also known as "normal distribution”, also known as Gaussian distribution
- Gaussian distribution is a very important probability distribution with a large standard deviation. It is widely used in fields such as mathematics, physics and engineering, and Many aspects of statistics have significant influence, and most data in practical applications follow a normal distribution.
- the string to be processed when the string to be processed needs to be compressed, the string to be processed can be received from other devices, or the string to be processed can be read from its own database.
- the embodiment of this application does not limit the method of obtaining the string to be processed.
- the data to be compressed is called a string to be processed
- the string to be processed may include a non-idle character string
- the non-idle character string includes a data code that conforms to a normal distribution
- the recorded position of the non-free number in the string to be processed can be first obtained, and then, based on the position, Determine the non-free strings among the above strings to be processed. For example, if the string to be processed is TF32 data, for example, if the string to be processed is "00 3E 00 28 00 00 00 00 07 EF 00 00 00 1E 0F 00", the index records the position of the idle number and the non-idle number.
- the embodiment of this application determines that the non-idle string in the above TF32 data is "3E 28" based on the recorded position of the non-idle number. 07EF 1E 0F”.
- the non-idle number and the column number of the non-idle number in the string to be processed can also be determined, so that based on the above The non-idle number and the column number of the non-idle number are used to obtain the non-idle string in the above string to be processed. For example, taking the string to be processed as the above TF32 data, determine the column numbers of the idle number and non-idle number in the above TF32 data. If the idle number is "00”, determine the "00" and non-"00" in the above TF32 data.
- Embodiments of this application achieve first-level compression of data by determining non-idle strings in the strings to be processed, reducing the resources occupied by the data. Compressed data can be transmitted during data transmission, thereby improving bandwidth usage efficiency.
- the first data code includes at least one character to be encoded.
- the difference between each data code and the average value of all data codes can be calculated, and then, based on the difference, the above-mentioned first data code is obtained, for example, the difference between each data code and the above-mentioned average value can be calculated.
- the difference value is used as a data code in the above-mentioned first data code, thereby obtaining the above-mentioned first data code.
- the data codes in the above-mentioned non-idle string conform to the normal distribution, and the difference between each data code and the above-mentioned average value, that is, the first data code conforms to the normal distribution around 0.
- the non-idle string in the TF32 data is "3E 28 07 EF 1E 0F".
- This string includes data codes that conform to the normal distribution, that is, the order code.
- the embodiment of the present application can calculate the order code minus
- the difference between the average values of the order codes is to obtain the first data code.
- the first data code can also be called the first order code.
- the first order code conforms to the normal distribution around 0.
- Data normally distributed around 0 can be better compressed using the binary encoding method to further reduce the resources occupied by the data and improve the bandwidth usage efficiency during data transmission.
- S203 Perform binary encoding on at least one character to be encoded in the first data code to obtain a second data code.
- At least one character to be encoded in the first data code can be binary encoded based on a preset encoding method to obtain the second data code.
- the second data code may be a binary code determined based on the binary number of the character to be encoded, or it may be a binary code determined based on the binary number of the frequency number of the character to be encoded, or it may be It is a binary encoding determined based on the relationship between the frequency number of the character to be encoded and the preset threshold.
- the embodiment of the present application can determine the character to be encoded with a preset bit in the first data code, and the preset bit is higher than the first data code.
- the bits of other characters to be encoded except the characters to be encoded with the above-mentioned preset bits are binary-encoded to obtain the second data code.
- the above-mentioned preset bits can be determined according to the actual situation.
- the non-idle string in the TF32 data is "3E 28 07 EF 1E 0F", which includes
- the data code that conforms to the normal distribution is the order code. Calculate the difference between the order code minus the average of the order code to obtain the first data code.
- the first data code can also be called the first order code.
- the character to be encoded with the preset bits in the first-order code can be determined, such as the character to be encoded with the higher 4 bits.
- the character to be encoded with the higher 4 bits in the first-order code is binary encoded to obtain the second data code (for TF32 data, the first
- the second data code can also be called the second order code).
- the upper 4 bits of the characters to be encoded in the first-order code are more in line with the normal distribution requirements. Accordingly, the binary encoding method can be better used for compression, further improving the bandwidth usage efficiency during data transmission.
- the first data code and the third data code can be processed.
- the two data codes are integrated to obtain the compression result of the above-mentioned string to be processed.
- integration can be random splicing, splicing according to set rules, random combination, combination according to set rules, etc. This embodiment does not limit the specific implementation of integration.
- the data compression method provided by the embodiment of the present application achieves first-level compression of data by determining the non-idle strings in the string to be processed, and then, based on the data codes in the non-idle strings and their average values, obtain
- the first data code includes at least one character to be encoded.
- the at least one character to be encoded in the first data code is binary encoded to achieve two-level compression of the data.
- the compression can be transmitted twice during data transmission. The resulting data improves bandwidth usage efficiency to meet the increasing data transmission needs.
- the resources occupied by the data are reduced, thereby reducing processor power consumption and energy costs.
- FIG 3 is a schematic flow chart of Embodiment 2 of the data compression method provided by this application. As shown in Figure 3, in this embodiment, the above step S203 may include the following steps:
- the frequency of occurrence of each character to be encoded in the string to be encoded can be counted, and then the characters to be encoded can be binary encoded.
- each character to be encoded in order to further reduce the resource occupation of binary encoding, can be encoded based on the frequency of occurrence of each character to be encoded, that is, the frequency of occurrence of the character to be encoded is controlled to be consistent with the binary encoding.
- the length is inversely proportional, that is, the length of the binary code of a character to be coded that appears frequently is smaller than the length of the binary code of a character to be coded that appears frequently.
- Huffman decoding method is often used for encoding.
- Huffman coding is a consistent coding method (also known as "entropy coding method"), which is used for lossless compression of data.
- Table 1 is an example of existing Huffman coding.
- Table 1 for a set of characters "A”, “B”, “C”, “D”, “E”, the number of occurrences of “A” is 8, the number of occurrences of “B” is 10, and the number of occurrences of “C” “The number of occurrences of “D” is 3, the number of occurrences of “D” is 4, and the number of occurrences of “E” is 5.
- the encoding of "B” is 11
- the encoding of "A” is 10
- the encoding of "C” The code for "D” is 010
- the code for "D” is 011
- the code for "E” is 00.
- Huffman encoding is a variable-length encoding, and the encoding length of each character is variable. Therefore, Huffman decoding can only be performed serially, that is, it must be decoded sequentially from front to back, which has the problems of low efficiency and slow decoding speed. .
- the above-mentioned second data code at least includes a separator.
- the second data code is obtained by binary encoding the character to be encoded, and the second data code at least includes a separator; accordingly,
- the decoding method includes: obtaining the string to be decoded, determining each separator in the plurality of binary symbols in the string to be decoded, decoding the string to be decoded according to each separator, and obtaining each original character corresponding to the string to be decoded.
- each binary code (second data code) includes a separator, during the decoding process, the boundaries of each binary code can be quickly found, thereby achieving parallel decoding, improving decoding efficiency, and thus saving chips.
- the area and power consumption are reduced, the decoding data transmission delay is shortened, and the performance of the chip is greatly improved.
- FIG. 4 is a schematic flowchart of Embodiment 3 of the data compression method provided by the present application. As shown in Figure 4, in this embodiment, the above step S302 can be implemented through the following steps:
- the frequency sequence number is a positive integer sequentially identified starting from 1.
- the characters to be encoded can be frequency numbered based on the frequency of occurrence of each character to be encoded.
- the frequency number of each character to be encoded is determined in the order of the frequency of occurrence of each character to be encoded in the first data code from high to low.
- the frequency serial number is a positive integer sequentially identified starting from 1; then the delimiter of the character to be encoded is determined according to the frequency serial number, and then the binary code of the character to be encoded is determined according to the above frequency serial number and the determined delimiter.
- the delimiter includes a binary end character and a prefix character with opposite values.
- the number of digits of the prefix character is determined by the frequency number of the character to be encoded.
- the value minus 1 has the same number of binary digits.
- the binary encoding of the character to be encoded may also include an intermediate symbol determined based on the binary number minus 1 from the value of the frequency number.
- the ending character is a one-digit 1.
- the ending character can also be other digits and values.
- the ending character is a one-digit 0.
- the prefix character can be composed of a corresponding number of 1s.
- the ending character can also be a two-digit 1. , at this time the prefix should have a corresponding composition method, which will not be described here.
- Table 2 is an example of performing binary encoding on the character to be encoded based on the frequency number and delimiter to obtain the second data code.
- the end character of the binary code is represented by one bit of 1
- the prefix character of the binary code is represented by 0
- the number of digits of each binary code prefix character is the same as the number of binary digits of the frequency sequence number minus 1.
- the intermediate symbol of binary encoding is the binary number minus 1 from the frequency sequence number.
- the frequency sequence number is 1 (the frequency sequence number is reduced by 1 to 0, the binary number of 0 is 0, and the number of digits is 1)
- the prefix character is "0", the end character is "1", and the intermediate character is "0”
- the frequency sequence number When it is 2 (the frequency number is reduced by 1 to 1, the binary number of 1 is 1, and the number of digits is 1), the prefix character is "0", the end character is "1", and the intermediate character is "1”
- the frequency number is 3 (frequency When the serial number is reduced by 1 to 2, the binary number of 2 is 10, and the number of digits is 2), the prefix symbol is "00", the end symbol is "1", and the intermediate symbol is "10”; the frequency serial number is 4 (the frequency serial number is decremented by 1). 3.
- the binary number of 3 is 11 and the number of digits is 2)
- the prefix character is "00", the end character is "1", and the middle character is "11".
- the prefix character is "000", the ending character is "1", and the intermediate character is "xxx".
- the frequency number is 5 (the frequency number is reduced by 1) is 4, the binary number of 4 is 100, 3 digits), the prefix character is "000", the end character is “1", the middle character "xxx” is "100”, the frequency sequence number is 8 (the frequency sequence number minus 1 is 7.
- the binary number of 7 is 111 and the number of digits is 3
- the prefix character is "000", the end character is "1", and the middle character "xxx" is "111".
- the prefix character is "0000", the ending character is "1", and the middle character is "xxxx".
- the frequency number is 9 (the frequency number is reduced by 1) is 8, the binary number of 8 is 1000, 4 digits), the prefix character is "0000", the end character is “1", the middle character "xxxx” is "1000”, the frequency sequence number is 15 (the frequency sequence number minus 1 is 14.
- the binary number of 14 is 1110 (4 digits)
- the prefix character is "0000", the end character is "1”, and the intermediate character "xxxx” is "1110".
- the binary encoding of other characters to be encoded is determined in a similar way. , will not be described in detail here.
- Table 2 Based on the frequency serial number and delimiter, binary encoding is performed on the characters to be encoded to obtain the second data code.
- Frequency serial number Decrease frequency number by 1 binary encoding 1 0 0 0 1 2 1 0 1 1 3 2 00 10 1 4 3 00 11 1 5 ⁇ 8 4 ⁇ 7 000xxx 1 9 ⁇ 16 8 ⁇ 15 0000 xxxx 1
- the delimiter includes a binary end character and a prefix character with opposite values. At this time, the number of digits of the prefix character is determined.
- the method is related to the frequency number and preset threshold.
- the above-mentioned character string to be encoded includes a first character set and a second character set divided according to frequency numbers and preset thresholds.
- the frequency number of the first character to be encoded in the first character set is less than Or equal to the preset threshold, and the frequency number of the second character to be encoded in the second character set is greater than the preset threshold.
- the above-mentioned separator includes a first prefix character and an end character with opposite binary values, and the number of digits of the first prefix character is equal to the value of the frequency serial number minus 1;
- the above-mentioned delimiter includes a second prefix character and a tail character with opposite binary values.
- the number of digits of the second prefix character is at least 1 more than the number of digits of the first prefix character with the largest number of digits.
- the number of digits of the second prefix symbol is greater than or equal to the intermediate symbol determined based on the binary number minus 1 from the value of the frequency sequence number.
- the number of binary digits corresponding to the frequency number minus 1 of the second character to be encoded is less than or equal to the preset threshold, then the number of digits of the intermediate character is equal to the preset threshold plus 1;
- the number of binary digits minus 1 from the frequency number corresponding to the second character to be encoded is greater than the preset threshold, then the number of digits of the intermediate character is equal to the number of binary digits minus 1 from the frequency number corresponding to the second character to be encoded.
- Table 3 is another example of performing binary encoding on the character to be encoded based on the frequency number and delimiter to obtain the second data code.
- the preset threshold is equal to 3
- the binary coded end character is represented by one bit "1”
- the binary coded prefix character is represented by "0”
- the number of digits of each binary coded prefix character is the same as
- the value of the frequency number minus 1 is determined by the preset threshold 3.
- the binary coded separator includes the binary value minus 1 according to the frequency number.
- the first prefix character and the ending character of the number of digits are "1"; when the frequency number is 2 (the frequency number minus 1 is 1), the prefix character is "0" and the ending character is "1".
- the binary coded separator when the frequency number of the character to be encoded is 1 (the frequency number minus 1 is 0), the binary coded separator does not include the prefix character but includes the end character and is "1"; the frequency number is 2 (the frequency number minus 1 is 0).
- the prefix character When 1 is 1), the prefix character is "0" and the ending character is "1"; when the frequency number is 4 (the frequency number minus 1 is 3), the prefix character is "000" and the ending character is "1".
- the prefix character is "0000" and the end character is "1".
- the binary encoding of the second character to be encoded also includes the character based on the frequency sequence number.
- the intermediate character "xxxx" determined by the binary number.
- the number of binary digits corresponding to the frequency number minus 1 of the second character to be encoded is less than or equal to the preset threshold, then the number of digits of the intermediate character is equal to the preset threshold plus 1; for example, the second character to be encoded in Table 3
- the frequency number is 5 to 8
- the number of binary digits in the frequency number minus 1 (4 to 7) is equal to the preset threshold 3
- the number of digits in the intermediate symbol is equal to the preset threshold 3 plus 1, that is, 4.
- the number of binary digits corresponding to the frequency number of the second character to be encoded is greater than the preset threshold, then the number of digits of the intermediate character is equal to the number of binary digits of the frequency number corresponding to the second character to be encoded.
- the frequency number of the second character to be encoded in Table 3 is 9 to 16
- the number of binary digits of the frequency number minus 1 (8 to 15) is equal to 4, which is greater than the preset threshold 3
- the number of digits of the intermediate character is equal to the frequency.
- the serial number is 4 binary digits minus 1 (8 ⁇ 15).
- Table 3 Another example of binary encoding the characters to be encoded to obtain the second data code based on the frequency serial number and delimiter.
- Frequency serial number Decrease frequency number by 1 binary encoding 1 0 1 2 1 0 1 3 2 00 1 4 3 000 1 5 ⁇ 16 4 ⁇ 15 0000 xxxx 1
- the data compression method provided by the embodiment of the present application determines the frequency of occurrence of each character to be encoded in the first data code, and performs binary encoding on the character to be encoded based on the above frequency of occurrence to obtain the second data code, wherein the character with the highest frequency of occurrence is
- the length of the binary code of the character to be encoded is smaller than the length of the binary code of the character to be encoded that appears less frequently.
- the data compression method provided by the embodiments of the present application can greatly facilitate subsequent decoding by supplementing delimiters, and has less impact on the compression rate of the string to be encoded.
- the encoding method shown in Table 3 above for the frequency serial numbers with original values from 4 to 15, an additional delimiter consisting of the end character "1" and the prefix character "0000” is added, which can greatly facilitate subsequent decoding. Since the value is a normal distribution with a large standard deviation, the extra supplementary end character "1" and prefix character "0000" have little impact on the overall data compression rate.
- the above embodiment describes the encoding process of data.
- decoding data taking the encoding shown in Table 3 as an example, the embodiment of the present application can obtain a string to be decoded during decoding.
- the string to be decoded includes multiple binary symbols, determine each separator in the multiple binary symbols, determine each binary code included in the string to be decoded based on each separator, and determine each frequency sequence number corresponding to each binary code based on each binary code and the preset threshold. , and finally, determine each original character corresponding to the string to be decoded according to the preset mapping relationship and each frequency serial number. This mapping relationship is used to represent the corresponding relationship between the frequency serial number and the original character.
- multiple binary codes included in the string to be decoded can be decoded in parallel, which improves decoding efficiency and reduces resource consumption.
- FIG. 5 is a schematic structural diagram of an embodiment of a data compression device provided by this application.
- the data compression device may include:
- the first compression module 501 is used to determine non-idle character strings among the character strings to be processed, where the non-idle character strings include data codes, and the distribution of the data codes conforms to a normal distribution.
- the processing module 502 is configured to obtain a first data code according to the data code and the average value of the data code, where the first data code includes at least one character to be encoded.
- the second compression module 503 is used to perform binary encoding on at least one character to be encoded in the first data code to obtain a second data code.
- Obtaining module 504 is used to obtain the compression result of the string to be processed based on other characters in the string to be processed except the data code, the first data code and the second data code.
- the first compression module 501 is specifically used to:
- non-free character strings among the character strings to be processed are determined.
- the first compression module 501 is specifically used to:
- a non-idle character string in the string to be processed is obtained.
- the second compression module 503 is specifically used to:
- the characters to be encoded are binary encoded to obtain the second data code, wherein the length of the characters to be encoded with a high frequency of occurrence after binary encoding is smaller than the length of the characters to be encoded with a low frequency of occurrence.
- the length of the character after binary encoding is smaller than the length of the characters to be encoded with a low frequency of occurrence.
- the second data code at least includes a delimiter.
- the second compression module 503 is specifically used for:
- binary encoding is performed on the character to be encoded to obtain the second data code.
- the first data code includes a first character set and a second character set divided according to the frequency number and a preset threshold, and the first character set to be encoded in the first character set The frequency number of the character is less than or equal to the preset threshold, and the frequency number of the second character to be encoded in the second character set is greater than the preset threshold;
- the separator includes a first prefix character and an end character with opposite binary values, and the number of digits of the first prefix character is equal to the value of the frequency serial number minus 1;
- the separator includes a second prefix character and an end character with opposite binary values, and the number of digits of the second prefix character is greater than the number of digits of the first prefix character with the largest number of digits. At least 1 more.
- the binary encoding of the second character to be encoded further includes an intermediate symbol determined according to the binary number of the corresponding frequency number.
- the number of binary digits corresponding to the frequency number minus 1 of the second character to be encoded is less than or equal to the preset threshold, then the number of digits of the intermediate symbol is equal to the preset Add 1 to the threshold;
- the number of binary digits of the frequency number minus 1 corresponding to the second character to be encoded is greater than the preset threshold, then the number of digits of the intermediate symbol is equal to the number of binary digits of the frequency number minus 1 corresponding to the second character to be encoded.
- the end character is a one-bit 1.
- the second compression module 503 is specifically used to:
- Binary encoding is performed on the character to be encoded in the preset bits to obtain a second data code.
- processing module 502 is specifically used to:
- the first data code is obtained.
- the device provided by the embodiment of the present application can be used to execute the technical solution of the above-mentioned data compression method embodiment. Its implementation principles and technical effects are similar and will not be described again here.
- each module of the above device is only a division of logical functions. In actual implementation, they can be fully or partially integrated into a physical entity, or they can also be physically separated. And these modules can all be implemented in the form of software calling through processing components; they can also all be implemented in the form of hardware; some modules can also be implemented in the form of software calling through processing components, and some modules can be implemented in the form of hardware.
- the processing module can be a separate processing element, or can be integrated into a chip of the above device.
- it can also be stored in the memory of the above device in the form of program code, and can be processed by a certain processing element of the above device. Call and execute the functions of the above modules.
- each step of the above method or each of the above modules can be completed by instructions in the form of hardware integrated logic circuits or software in the processor element.
- the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more application specific integrated circuits (ASICs), or one or more microprocessors (digital signal processor, DSP), or one or more field programmable gate arrays (field programmable gate array, FPGA), etc.
- ASICs application specific integrated circuits
- DSP digital signal processor
- FPGA field programmable gate array
- the processing element can be a general-purpose processor, such as a central processing unit (CPU) or other processor that can call the program code.
- CPU central processing unit
- IPU intelligent processor
- these modules can be integrated together and implemented in the form of a system-on-a-chip (SOC).
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
- the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated therein.
- the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.
- FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- the electronic device may include: a processor 601 , a memory 602 , a communication interface 603 and a system bus 604 .
- the memory 602 and the communication interface 603 are connected to the processor 601 through the system bus 604 and complete communication with each other.
- the memory 602 is used to store computer program instructions
- the communication interface 603 is used to communicate with other devices
- the processor 601 executes the above-mentioned computer program.
- the program instructions implement the technical solutions of the above method embodiments.
- the system bus mentioned in Figure 6 can be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
- PCI peripheral component interconnect
- EISA extended industry standard architecture
- the system bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
- the communication interface is used to implement communication between electronic devices and other devices (such as clients, read-write libraries, and read-only libraries).
- the memory may include random access memory (RAM) and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
- the above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can be a special-purpose processor, including a graphics processor (GPU), an intelligent processor (IPU), etc.; or it can be a digital signal Processor DSP, application specific integrated circuit ASIC, field programmable gate array FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
- embodiments of the present application also provide a computer-readable storage medium, in which computer instructions are stored, and when executed by a processor, the computer instructions are used to implement the techniques of the above-mentioned method embodiments. plan.
- embodiments of the present application also provide a chip that runs instructions, and the chip is used to execute the technical solutions of the above method embodiments.
- Embodiments of the present application also provide a computer program product.
- the computer program product includes a computer program.
- the computer program is stored in a computer-readable storage medium.
- At least one processor can read from the computer-readable storage medium.
- the computer program when the at least one processor executes the computer program, can implement the technical solutions of the above method embodiments.
- a data compression method comprising:
- the non-idle strings include data codes, and the distribution of the data codes conforms to the normal distribution;
- a first data code is obtained, and the first data code includes at least one character to be encoded;
- the compression result of the character string to be processed is obtained.
- the determining of non-idle strings in the strings to be processed includes:
- non-free character strings among the character strings to be processed are determined.
- the determining of non-idle strings in the strings to be processed includes:
- a non-idle character string in the string to be processed is obtained.
- Clause A4 The method according to any one of clauses A1-A3, wherein the binary encoding of at least one character to be encoded in the first data code to obtain the second data code includes:
- the characters to be encoded are binary encoded to obtain the second data code, wherein the length of the characters to be encoded with a high frequency of occurrence after binary encoding is smaller than the length of the characters to be encoded with a low frequency of occurrence.
- the length of the character after binary encoding is smaller than the length of the characters to be encoded with a low frequency of occurrence.
- the second data code at least includes a delimiter
- binary encoding is performed on the character to be encoded to obtain the second data code.
- the first data code includes a first character set and a second character set divided according to the frequency number and a preset threshold, and the frequency number of the first character to be encoded in the first character set is less than or equal to the preset Threshold, the frequency number of the second character to be encoded in the second character set is greater than the preset threshold;
- the separator includes a first prefix character and an end character with opposite binary values, and the number of digits of the first prefix character is equal to the value of the frequency serial number minus 1;
- the separator includes a second prefix character and an end character with opposite binary values, and the number of digits of the second prefix character is greater than the number of digits of the first prefix character with the largest number of digits. At least 1 more.
- the binary encoding of the second character to be encoded also includes an intermediate symbol determined according to the binary number of the corresponding frequency number.
- the number of binary digits corresponding to the frequency number minus 1 of the second character to be encoded is less than or equal to the preset threshold, then the number of digits of the intermediate symbol is equal to the preset threshold plus 1;
- the number of binary digits of the frequency number minus 1 corresponding to the second character to be encoded is greater than the preset threshold, then the number of digits of the intermediate symbol is equal to the number of binary digits of the frequency number minus 1 corresponding to the second character to be encoded.
- Clause A9 The method according to Clause A6, wherein the terminator is a one-bit 1.
- Clause A10 The method according to any one of clauses A1-A3, wherein the binary encoding of at least one character to be encoded in the first data code to obtain the second data code includes:
- Binary encoding is performed on the character to be encoded in the preset bits to obtain a second data code.
- Clause A11 The method according to any one of Clauses A1-A3, obtaining the first data code based on the data code and the average value of the data code, including:
- the first data code is obtained.
- a data compression device comprising:
- the first compression module is used to determine non-idle strings among the strings to be processed, where the non-idle strings include data codes, and the distribution of the data codes conforms to a normal distribution;
- a processing module configured to obtain a first data code based on the data code and the average value of the data code, where the first data code includes at least one character to be encoded;
- the second compression module is used to perform binary encoding on at least one character to be encoded in the first data code to obtain a second data code;
- Obtaining module configured to obtain the compression result of the string to be processed based on other characters in the string to be processed except the data code, the first data code and the second data code.
- the first compression module is specifically used for:
- non-free character strings among the character strings to be processed are determined.
- the first compression module is specifically used for:
- a non-idle character string in the string to be processed is obtained.
- the second compression module is specifically used for:
- the characters to be encoded are binary encoded to obtain the second data code, wherein the length of the characters to be encoded with a high frequency of occurrence after binary encoding is smaller than the length of the characters to be encoded with a low frequency of occurrence.
- the length of the character after binary encoding is smaller than the length of the characters to be encoded with a low frequency of occurrence.
- Clause A26 The device according to Clause A15, the second data code comprising at least a delimiter
- the second compression module is specifically used for:
- binary encoding is performed on the character to be encoded to obtain the second data code.
- the first data code includes a first character set and a second character set divided according to the frequency number and a preset threshold, and the frequency number of the first character to be encoded in the first character set is less than or equal to the preset Threshold, the frequency number of the second character to be encoded in the second character set is greater than the preset threshold;
- the separator includes a first prefix character and an end character with opposite binary values, and the number of digits of the first prefix character is equal to the value of the frequency serial number minus 1;
- the separator includes a second prefix character and an end character with opposite binary values, and the number of digits of the second prefix character is greater than the number of digits of the first prefix character with the largest number of digits. At least 1 more.
- the binary encoding of the second character to be encoded also includes an intermediate symbol determined according to the binary number of the corresponding frequency number.
- the number of binary digits corresponding to the frequency number minus 1 of the second character to be encoded is less than or equal to the preset threshold, then the number of digits of the intermediate symbol is equal to the preset threshold plus 1;
- the number of binary digits of the frequency number minus 1 corresponding to the second character to be encoded is greater than the preset threshold, then the number of digits of the intermediate symbol is equal to the number of binary digits of the frequency number minus 1 corresponding to the second character to be encoded.
- Clause A20 The apparatus of clause A17, wherein the terminating character is a one-bit 1.
- the second compression module is specifically used for:
- Binary encoding is performed on the character to be encoded in the preset bits to obtain a second data code.
- the processing module is specifically used for:
- the first data code is obtained.
- An electronic device comprising: a processor, a memory, and computer program instructions stored on the memory and executable on the processor;
- Clause A24 A computer-readable storage medium storing computer instructions, which when executed by a processor are used to implement the data described in any one of clauses A1 to A11 above. Compression method.
- Clause A25 A computer program product, including a computer program that implements the data compression method described in any one of the above clauses A1 to A11 when executed by a processor.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
本申请提供一种数据压缩方法、装置、设备及存储介质,涉及数据处理技术领域。该电子设备包括处理器、存储器及存储在所述存储器上并可在处理器上运行的计算机程序指令。该技术方案,对数据进行两次压缩,在数据传输时可以传输两次压缩后的数据,提高带宽使用效率,满足不断增加的数据传输需求,而且,通过两次数据压缩,减少数据占用的资源,降低功耗,减少能源成本。
Description
本申请要求于2022年06月01日提交中国国家知识产权局、申请号为202210617490.4、申请名称为“数据压缩方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及数据处理技术领域,尤其涉及一种数据压缩方法、装置、设备及存储介质。
随着信息时代的来临,数据出现爆发性的增长趋势。以数据传输为例,需要传输的数据逐渐迅速,对传输带宽的要求与日俱增。相关技术中,为了满足数据传输要求,通过不断增加传输带宽来解决问题。
但是,如果无限制的通过增加带宽的方式来满足这些不断增加的数据传输需求,投入是巨大的,不能从根本上解决问题,因此,如何提高带宽的使用效率,以满足不断增加的数据传输需求成为一个亟待解决的问题。
发明内容
本申请提供一种数据压缩方法、装置、设备及存储介质,对数据进行压缩,从而,提高带宽使用效率,满足不断增加的数据传输需求。
第一方面,本申请提供一种数据压缩方法,包括:
确定待处理字符串中非空闲的字符串,所述非空闲的字符串中包括数据码,所述数据码的分布符合正态分布;
根据所述数据码和所述数据码的平均值,获得第一数据码,所述第一数据码包括至少一个待编码字符;
对所述第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码;
根据所述待处理字符串中除所述数据码外的其它字符、所述第一数据码 和所述第二数据码,得到所述待处理字符串的压缩结果。
第二方面,本申请提供一种数据压缩装置,包括:
第一压缩模块,用于确定待处理字符串中非空闲的字符串,所述非空闲的字符串包括数据码,所述数据码的分布符合正态分布;
处理模块,用于根据所述数据码和所述数据码的平均值,获得第一数据码,所述第一数据码包括至少一个待编码字符;
第二压缩模块,用于对所述第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码;
获得模块,用于根据所述待处理字符串中除所述数据码外的其它字符、所述第一数据码和所述第二数据码,得到所述待处理字符串的压缩结果。
第三方面,本申请提供一种电子设备,包括:处理器、存储器及存储在所述存储器上并可在处理器上运行的计算机程序指令,所述处理器执行所述计算机程序指令时实现如上述第一方面所述的数据压缩方法。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,所述计算机指令被处理器执行时用于实现如上述第一方面所述的数据压缩方法。
第五方面,本申请提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现上述第一方面所述的数据压缩方法。
本申请提供的数据压缩方法、装置、设备及存储介质,在数据压缩过程中,通过确定待处理字符串中非空闲的字符串,实现对数据的一级压缩,根据上述非空闲的字符串中的数据码和其平均值,获得第一数据码,第一数据码包括至少一个待编码字符,进而,对第一数据码中至少一个待编码字符进行二进制编码,实现对数据的二级压缩,从而,在数据传输时可以传输两次压缩后的数据,提高带宽使用效率,满足不断增加的数据传输需求,而且,通过两次数据压缩,减少数据占用的资源,降低功耗,减少能源成本。
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。
图1A是本申请实施例提供的数据压缩的一种应用场景示意图;
图1B是本申请实施例提供的数据压缩的另一种应用场景示意图;
图2是本申请提供的数据压缩方法实施例一的流程示意图;
图3是本申请提供的数据压缩方法实施例二的流程示意图;
图4是本申请提供的数据压缩方法实施例三的流程示意图;
图5为本申请提供的数据压缩装置实施例的结构示意图;
图6为本申请实施例提供的电子设备的结构示意图。
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述。这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
随着互联网、物联网的飞速发展,数据文件规模越来越大。以数据传输为例,需要传输的数据逐渐迅速,对传输带宽的要求与日俱增。如果无限制的通过增加带宽的方式来满足这些不断增加的数据传输需求,投入是巨大的,不能从根本上解决问题,因此,如何提高带宽的使用效率,以满足不断增加的数据传输需求成为一个亟待解决的问题。
针对上述问题,本申请实施例提供了一种数据压缩方法,该方法通过确定待处理字符串中非空闲的字符串,实现对数据的一级压缩,根据上述非空闲的字符串中的数据码和其平均值,获得第一数据码,第一数据码包括至少一个待编码字符,进而,对第一数据码中至少一个待编码字符进行二进制编码,实现对数据的二级压缩,从而,在数据传输时可以传输两次压缩后的数据,提高带宽使用效率,满足不断增加的数据传输需求。
示例性的,图1A是本申请实施例提供的数据压缩的一种应用场景示意图。如图1A所示,该应用场景中可以包括第一设备11和第二设备12,且第一设备11和第二设备12可以通过有线或无线方式进行通信。
可选的,在本申请的实施例中,第一设备11和第二设备12均具有压缩和解压缩的能力。
在一种可能的实施例中,第一设备11和/或第二设备12可以确定待处理字符串中非空闲的字符串,并根据上述非空闲的字符串中的数据码和其平均值,获得第一数据码,第一数据码包括至少一个待编码字符,进而,对第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码,从而,根据上述待处理字符串中除上述数据码外的其它字符、第一数据码和第二数据码,得到压缩结果,相应的,在需要使用待处理字符串时,再取出压缩结果进行解压缩操作,得到压缩前的待处理字符串。
在一种可能的实施例中,第一设备11可以对待处理字符串进行上述处理,得到上述压缩结果,并将上述压缩结果发送至第二设备12,第二设备12再对上述压缩结果进行解压缩,从而得到压缩前的待处理字符串。
可理解,本申请实施例并不限定第一设备11和第二设备12的具体操作,其可以根据实际场景确定,此处不作赘述。
在本实施例中,第一设备11和第二设备12并不表示设备的序号,仅用来表示不同的设备。
可以理解的是,图1A所示的场景中还可以包括其他设备,例如,存储设备,本申请实施例并不对其进行限定。
示例性的,图1B是本申请实施例提供的数据压缩的另一种应用场景示意图。该应用场景以设备内部的数据处理进行解释说明。如图1B所示,该应用场景中设备10包括芯片101和存储器102。
可选的,在本实施例中,芯片101上部署有片上网络(network on chip,NoC)1011和人工智能处理器1012。NoC 1011能够提供计算和通信功能,因而,在人工智能处理器1012上运行的数据可以通过NoC 1011处理后写入存储器102中,并可从存储器102中读取数据后对其处理,最后再传输至人工智能处理器1012。
在本申请的实施例中,人工智能处理器1012上运行的数据通过NoC 1011写入存储器102之前,NoC 1011可以利用本申请提供的数据压缩方法进行数据压缩,然后将压缩结果写入存储器102,相应的,NoC 1011从存储器102中读出数据后、传送给人工智能处理器1012之前,可以对读出的数据进行解 压缩,然后将解压缩后的数据送入人工智能处理器1012进行运算。
可理解,在本实施例中,通过将运行在处理器上的数据经过数据压缩存储至存储器102,将从存储器102中读取的数据经过解压缩后再传送至人工智能处理器1012的方案可以有效节省NOC的带宽,提高资源利用率,进一步的,通过本申请实施例提供的数据压缩方法进行处理,能够进一步提高处理效率,节省芯片的面积和功耗,缩短数据传输延迟,大幅提升芯片的性能。
可理解,图1B所示的场景中还可以包括其他组成部分,例如,收发器,本申请实施例并不对其进行限定。
可选的,在本实施例中,存储器102,即本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器,此处不作赘述。
需要说明的是,执行本申请实施例的设备可以是终端设备,也可以是服务器或者虚拟机等,还可以是一个或多个服务器和/或计算机等组成的分布式计算机系统等。其中,该终端设备包括但不限于:智能手机、笔记本电脑、台式电脑、平台电脑、车载设备、智能穿戴设备等;服务器可以为普通服务器或者云服务器,云服务器又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器等,本申请实施例不作限定。
值得说明的是,本申请的产品实现形态可以包含在软件程序中,并部署在设备(也可以是计算云或移动终端等具有计算能力的硬件)上的程序代码。本申请的程序代码可以存储在执行本申请实施例的设备内部。运行时,程序代码运行于该设备的中央处理器(central processing unit,CPU)和/或人工智能处理器芯片。
本申请实施例中,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
下面将结合附图,以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。
图2是本申请提供的数据压缩方法实施例一的流程示意图。如图2所示,该数据压缩方法可以包括如下步骤:
S201、确定待处理字符串中非空闲的字符串,该非空闲的字符串中包括数据码。
可选地,上述数据码的分布符合正态分布。在数据处理的应用场景中,为了降低数据存储空间和/或减小网络传输对带宽的要求,可以基于本申请实施例的技术方案对符合正态分布的数据进行压缩。
在实际应用中,正态分布,也称“常态分布”,又名高斯分布(Gaussian distribution),是一个非常重要的概率分布,具有标准差较大的特点,在数学、物理及工程等领域以及统计学的许多方面有着重大的影响力,而且,实际应用中的大多数数据符合正态分布。
示例性的,当需要对待处理字符串进行压缩时,可以从其他设备接收待处理字符串,也可以从自身的数据库中读取待处理字符串。本申请实施例不对获取待处理字符串的方式进行限定。
在本实施例中,将待压缩的数据称为待处理字符串,待处理字符串可以包括非空闲的字符串,且非空闲的字符串包括符合正态分布的数据码。
另外,在本申请实施例的一种可能设计中,在确定待处理字符串中非空闲的字符串时,可以先获取记录的待处理字符串中非空闲数的位置,进而,基于该位置,确定上述待处理字符串中非空闲的字符串。例如以待处理字符串为TF32数据为例,如待处理字符串为“00 3E 00 28 00 00 00 00 07 EF 00 00 00 1E 0F 00”,索引(index)记录空闲数和非空闲数的位置,如空闲数为“00”,index记录“00”和非“00”的位置,本申请实施例基于记录的非空闲数的位置,确定上述TF32数据中的非空闲的字符串为“3E 28 07 EF 1E 0F”。
可选的,在本申请的实施例中,在确定待处理字符串中非空闲的字符串时,还可以确定待处理字符串中的非空闲数和非空闲数的列号,从而,基于上述非空闲数和非空闲数的列号,获得上述待处理字符串中非空闲的字符串。例如还以待处理字符串为上述TF32数据为例,确定上述TF32数据中的空闲数和非空闲数的列号,如空闲数为“00”,确定上述TF32数据中“00”和非“00”的列号,如列号为由1开始顺序标识的正整数,确定上述TF32数据中非空闲数“3E”,列号为“3”和“4”,非空闲数“28”,列号为“7”和“8”, 非空闲数“07”,列号为“17”和“18”,非空闲数“EF”,列号为“19”和“20”,非空闲数“1E”,列号为“27”和“28”,非空闲数“0F”,列号为“29”和“30”,进而,基于上述非空闲数和非空闲数的列号,获得上述TF32数据中的非空闲的字符串为“3E 28 07 EF 1E 0F”。
本申请实施例通过确定待处理字符串中非空闲的字符串,实现对数据的一级压缩,减少数据占用的资源,在数据传输时可以传输压缩后的数据,提高带宽使用效率。
S202、根据上述数据码和上述数据码的平均值,获得第一数据码,第一数据码包括至少一个待编码字符。
在本申请的实施例中,可以计算每个数据码与所有数据码的平均值的差值,进而,基于该差值,获得上述第一数据码,例如将每个数据码与上述平均值的差值作为上述第一数据码中的一数据码,从而,获得上述第一数据码。
这里,上述非空闲的字符串中的数据码符合正态分布,每个数据码与上述平均值的差值,即第一数据码符合围绕0的正态分布,如以待处理字符串为上述TF32数据为例,TF32数据中非空闲的字符串为“3E 28 07 EF 1E 0F”,该字符串中包括符合正态分布的数据码,即阶码,本申请实施例可以计算阶码减去阶码平均值的差值,获得第一数据码,此处对于TF32数据,第一数据码也可称为第一阶码,该第一阶码符合围绕0的正态分布。对于围绕0正态分布的数据可以更好的采用二进制编码方法进行压缩,以进一步减少数据占用的资源,提高数据传输时的带宽使用效率。
S203、对第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码。
在本申请的实施例中,可以基于预置编码方式对第一数据码中至少一个待编码字符进行二进制编码,得到第二数据码。
可选的,在本申请的实施例中,第二数据码可以是基于待编码字符的二进制数确定的二进制编码,也可以是基于待编码字符的频次序号的二进制数确定的二进制编码,还可以是基于待编码字符的频次序号与预置阈值的关系确定的二进制编码。
另外,本申请实施例在对第一数据码中至少一个待编码字符进行二进制 编码时,可以确定第一数据码中预设比特位的待编码字符,该预设比特位高于第一数据码中除上述预设比特位的待编码字符外其它待编码字符的比特位,进而,对上述预设比特位的待编码字符进行二进制编码,获得第二数据码。其中,上述预设比特位可以根据实际情况确定,例如,以待处理字符串为上述TF32数据为例,TF32数据中非空闲的字符串为“3E 28 07 EF 1E 0F”,该字符串中包括符合正态分布的数据码,即阶码,计算阶码减去阶码平均值的差值,获得第一数据码,对于TF32数据,第一数据码也可称为第一阶码,进而,可以确定第一阶码中预设比特位的待编码字符,如高4bit的待编码字符,对第一阶码中高4bit的待编码字符进行二进制编码,获得第二数据码(对于TF32数据,第二数据码也可称为第二阶码)。这里,对于TF32数据,上述第一阶码中高4bit的待编码字符更符合正态分布要求,相应的,可以更好的采用二进制编码方法进行压缩,进一步提高数据传输时的带宽使用效率。
S204、根据上述待处理字符串中除上述数据码外的其它字符、第一数据码和第二数据码,得到上述待处理字符串的压缩结果。
示例性的,在对第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码后,可以对上述待处理字符串中除上述数据码外的其它字符、第一数据码和第二数据码进行整合,得到上述待处理字符串的压缩结果。
可理解,上述整合可以是随机拼接、按照设定规律拼接、随机组合、按照设定规律组合等。本实施例并不对整合的具体实现进行限定。
本申请实施例提供的数据压缩方法,通过确定待处理字符串中非空闲的字符串,实现对数据的一级压缩,然后,根据上述非空闲的字符串中的数据码和其平均值,获得第一数据码,第一数据码包括至少一个待编码字符,对第一数据码中至少一个待编码字符进行二进制编码,实现对数据的二级压缩,进而,在数据传输时可以传输两次压缩后的数据,提高带宽使用效率,满足不断增加的数据传输需求,而且,通过两次数据压缩,减少数据占用的资源,从而降低处理器功耗,减少能源成本。
在图2所示实施例的基础上,下述通过图3所示的实施例对本申请实施例提供的数据压缩方法进行更详细的介绍。
图3是本申请提供的数据压缩方法实施例二的流程示意图。如图3所示, 在本实施例中,上述步骤S203可以包括如下步骤:
S301、确定各待编码字符在上述第一数据码中的出现频次。
在本实施例中,在获取到待编码字符串后,可以统计待编码字符串中各待编码字符出现的频次,进而对待编码字符进行二进制编码。
示例性的,对于待编码字符串“320E10”,可以确定出字符“0”的出现频次为“2”,字符“3”、字符“2”、字符“E”、字符“1”的出现频次均为1。
S302、根据上述出现频次,对待编码字符进行二进制编码,获得第二数据码,其中,出现频次高的待编码字符的二进制编码的长度小于出现频次低的待编码字符的二进制编码的长度。
示例性的,在本申请的实施例中,为了进一步降低二进制编码的资源占用,可以基于各待编码字符的出现频次对各待编码字符进行编码,即控制待编码字符的出现频次与二进制编码的长度成反比,也即,出现频次高的待编码字符的二进制编码的长度小于出现频次低的待编码字符的二进制编码的长度。
相关技术中,常采用哈夫曼解码方法进行编码。哈夫曼(Huffman)编码是一种一致性编码法(又称“熵编码法”),用于数据的无损耗压缩。
示例性的,表1是现有哈夫曼编码的一种示例。如表1所示,对于一组字符“A”、“B”、“C”、“D”、“E”,“A”的出现次数是8,“B”的出现次数是10,“C”的出现次数是3,“D”的出现次数是4,“E”的出现次数是5,根据上述编码原理,可以确定“B”的编码是11,“A”的编码是10,“C”的编码是010,“D”的编码是011,“E”的编码是00。
表1 哈夫曼编码的一种示例
字符 | 次数 | 编码 |
A | 8 | 10 |
B | 10 | 11 |
C | 3 | 010 |
D | 4 | 011 |
E | 5 | 00 |
由上述可知,哈夫曼编码是一种变长编码,各字符编码长度不定,因而 哈夫曼解码时只能串行执行,即必须从前往后顺序解码,存在效率低、解码速度慢的问题。
针对上述问题,本申请实施例中上述第二数据码至少包括分隔符,本申请实施例通过:对待编码字符进行二进制编码,获得第二数据码,第二数据码至少包括分隔符;相应的,解码方法包括:获取待解码字符串,确定待解码字符串中多个二进制符号中的各分隔符,根据各分隔符,对待解码字符串进行解码,得到待解码字符串对应的各原始字符。该技术方案中,由于各二进制编码(第二数据码)包括分隔符,因而,在解码过程中,可以迅速找出各二进制编码的边界,从而实现并行解码,提高了解码效率,从而节省了芯片的面积和功耗,缩短了解码数据传输延迟,大幅提升了芯片的性能。
在本申请的一种可能实现中,图4是本申请提供的数据压缩方法实施例三的流程示意图。如图4所示,在本实施例中,上述步骤S302可以通过如下步骤实现:
S401、按照各待编码字符在第一数据码中出现频次由高到底的顺序,确定各待编码字符的频次序号,频次序号为由1开始顺序标识的正整数。
S402、根据上述频次序号和分隔符,对待编码字符进行二进制编码,获得第二数据码。
可选的,可以基于各待编码字符的出现频次对待编码字符进行频次编号,例如,按照各待编码字符在第一数据码中出现频次由高到底的顺序,确定各待编码字符的频次序号,且频次序号为由1开始顺序标识的正整数;然后根据频次序号确定该待编码字符的分隔符,进而根据上述频次序号和确定的分隔符确定待编码字符的二进制编码。
在本申请实施例的一种可能设计中,根据各待编码字符的频次序号可以确定分隔符包括取值相反的二进制的结尾符和前缀符,此时,前缀符的位数与频次序号的取值减1的二进制位数相同。相应的,待编码字符的二进制编码还可以包括根据该频次序号的取值减1的二进制数确定的中间符。
可选的,在本申请的实施例中,结尾符为一位的1。
可理解,结尾符也可以是其他的位数和数值,例如,结尾符为一位的0,此时,前缀符可以采用对应数量的1组成,再比如,结尾符还可以是两位的1,此时前缀符应该有相应的组成方式,此处不作赘述。
例如,表2是基于频次序号和分隔符,对待编码字符进行二进制编码,获得第二数据码的一种示例。如表2所示,假设二进制编码的结尾符采用一位的1表示,则二进制编码的前缀符采用0表示,且各二进制编码的前缀符的位数与频次序号减1的二进制位数相同,二进制编码的中间符是该频次序号减1的二进制数。例如,频次序号为1(频次序号减1为0,0的二进制是0,位数1位)时,前缀符为“0”,结尾符为“1”,中间符为“0”;频次序号为2(频次序号减1为1,1的二进制是1,位数1位)时,前缀符为“0”,结尾符为“1”,中间符为“1”;频次序号为3(频次序号减1为2,2的二进制是10,位数2位)时,前缀符为“00”,结尾符为“1”,中间符为“10”;频次序号为4(频次序号减1为3,3的二进制是11,位数2位)时,前缀符为“00”,结尾符为“1”,中间符为“11”。
频次序号为5~8(频次序号减1为4~7)时,前缀符为“000”,结尾符为“1”,中间符为“xxx”,比如,频次序号为5(频次序号减1为4,4的二进制是100,位数3位)时,前缀符为“000”,结尾符为“1”,中间符“xxx”为“100”,频次序号为8(频次序号减1为7,7的二进制是111,位数3位)时,前缀符为“000”,结尾符为“1”,中间符“xxx”为“111”。
频次序号为9~16(频次序号减1为8~15)时,前缀符为“0000”,结尾符为“1”,中间符为“xxxx”,比如,频次序号为9(频次序号减1为8,8的二进制是1000,位数4位)时,前缀符为“0000”,结尾符为“1”,中间符“xxxx”为“1000”,频次序号为15(频次序号减1为14,14的二进制是1110,位数4位)时,前缀符为“0000”,结尾符为“1”,中间符“xxxx”为“1110”,其他待编码字符的二进制编码的确定方式类似,此处不作赘述。
表2 基于频次序号和分隔符,对待编码字符进行二进制编码,获得第二数据码的一种示例
频次序号 | 频次序号减1 | 二进制编码 |
1 | 0 | 0 0 1 |
2 | 1 | 0 1 1 |
3 | 2 | 00 10 1 |
4 | 3 | 00 11 1 |
5~8 | 4~7 | 000 xxx 1 |
9~16 | 8~15 | 0000 xxxx 1 |
在本申请实施例的一种可能设计中,根据各待编码字符的频次序号和预 置阈值可以确定分隔符包括取值相反的二进制的结尾符和前缀符,此时,前缀符的位数确定方式与频次序号和预置阈值有关。
示例性的,在该可能设计中,上述待编码字符串包括根据频次序号和预置阈值划分的第一字符集和第二字符集,该第一字符集中的第一待编码字符的频次序号小于或等于预置阈值,第二字符集中的第二待编码字符的频次序号大于预置阈值。
其中,针对第一字符集,上述分隔符包括取值相反的二进制的第一前缀符和结尾符,第一前缀符的位数等于频次序号的取值减1;
针对第二字符集,上述分隔符包括取值相反的二进制的第二前缀符和结尾符,第二前缀符的位数比位数最多的第一前缀符的位数至少多1位。
可选的,针对第二字符集,第二前缀符的位数大于或等于根据该频次序号的取值减1的二进制数确定的中间符。
示例性的,若第二待编码字符对应频次序号减1的二进制位数小于或等于预置阈值,则中间符的位数等于预置阈值加1;
若第二待编码字符对应频次序号减1的二进制位数大于预置阈值,则中间符的位数等于第二待编码字符对应频次序号减1的二进制位数。
例如,表3是基于频次序号和分隔符,对待编码字符进行二进制编码,获得第二数据码的另一种示例。如表3所示,假设预置阈值等于3,二进制编码的结尾符采用一位的“1”表示,则二进制编码的前缀符采用“0”表示,且各二进制编码的前缀符的位数与频次序号减1的取值和预置阈值3确定,待编码字符的频次序号的取值减1小于或等于预置阈值3时,二进制编码的分隔符包括根据频次序号的取值减1的二进制位数的第一前缀符和结尾符“1”;频次序号为2(频次序号减1为1)时,前缀符为“0”,结尾符为“1”。
例如,参照表3,待编码字符的频次序号为1(频次序号减1为0)时,二进制编码的分隔符不包括前缀符但包括结尾符为“1”;频次序号为2(频次序号减1为1)时,前缀符为“0”,结尾符为“1”;频次序号为4(频次序号减1为3)时,前缀符为“000”,结尾符为“1”。
可选的,频次序号为5~16(频次序号减1为4~15)时,前缀符为“0000”,结尾符为“1”,第二待编码字符的二进制编码还包括根据频次序号的二进制数确定的中间符“xxxx”。
可选的,若第二待编码字符对应频次序号减1的二进制位数小于或等于预置阈值,则中间符的位数等于预置阈值加1;例如,表3中的第二待编码字符的频次序号为5~8时,频次序号减1(4~7)的二进制位数等于预置阈值3,则中间符的位数等于预置阈值3加1,即4。
若第二待编码字符对应频次序号的二进制位数大于预置阈值,则中间符的位数等于第二待编码字符对应频次序号的二进制位数。例如,表3中的第二待编码字符的频次序号为9~16时,频次序号减1(8~15)的二进制位数等于4,大于预置阈值3,则中间符的位数等于频次序号减1(8~15)的二进制位数4。
可选的,参照表3所示,频次序号为5(频次序号减1为4)时,中间符“xxxx”为“0100”,频次序号为14(频次序号减1为13)时,中间符“xxxx”为“1101”,其他待编码字符的二进制编码的确定方式类似,此处不作赘述。
表3 基于频次序号和分隔符,对待编码字符进行二进制编码,获得第二数据码的另一种示例
频次序号 | 频次序号减1 | 二进制编码 |
1 | 0 | 1 |
2 | 1 | 0 1 |
3 | 2 | 00 1 |
4 | 3 | 000 1 |
5~16 | 4~15 | 0000 xxxx 1 |
本申请实施例提供的数据压缩方法,通过确定各待编码字符在第一数据码中的出现频次,根据上述出现频次,对待编码字符进行二进制编码,获得第二数据码,其中,出现频次高的待编码字符的二进制编码的长度小于出现频次低的待编码字符的二进制编码的长度。该技术方案的编码方法不仅能够在解码时能够提高解码速度,而且能够有效降低二进制编码的资源占用率。
基于上述各实施例记载的方案,本申请实施例提供的数据压缩方法,通过补充分隔符,可大幅方便后续解码,而且对待编码字符串的压缩率影响较小。例如,当采用上述表3所示编码方法时,对于原值在4~15的频次序号,多补充了由结尾符“1”和前缀符“0000”组成的分隔符,可大幅方便后续解码。由于数值是标准差较大的正态分布,所以多补充的结尾符“1”和前缀符“0000”,对整体的数据压缩率影响较小。
例如,以待编码字符的频次序号减1为“320E10”为例,每个频次序号 减1均采用4个比特位表示时,待编码字符的频次序号减1组成的字符串“320E10”编码前长度:6*4=24。表4是字符串“320E10”中各字符的二进制编码结果。如表4所示,采用上述表3所示编码方法时,字符“3”的二进制编码为“0001”,字符“2”的二进制编码为“001”,字符“0”的二进制编码为“1”,字符“E”的二进制编码为“000011101”,字符“1”的二进制编码为“01”,因而,待编码字符的频次序号减1组成的字符串“320E10”的二进制编码结果的长度为:4+3+1+9+2+1=20。
表4 字符串“320E10”中各字符的二进制编码结果
字符 | 3 | 2 | 0 | E | 1 | 0 |
二进制编码 | 0001 | 001 | 1 | 000011101 | 01 | 1 |
上述实施例描述的是数据的编码过程,在对数据进行解码时,如以表3所示编码为例,本申请实施例在解码时,可以获取待解码字符串,该待解码字符串包括多个二进制符号,确定多个二进制符号中的各分隔符,根据各分隔符,确定待解码字符串包括的各二进制编码,并根据各二进制编码和预置阈值,确定各二进制编码对应的各频次序号,最后根据预置映射关系和各频次序号,确定待解码字符串对应的各原始字符,该映射关系用于表示频次序号和原始字符的对应关系。该技术方案,在确定出各分隔符后,可以并行对待解码字符串包括的多个二进制编码进行解码,提高了解码效率,降低了资源消耗。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。
示例性的,图5为本申请提供的数据压缩装置实施例的结构示意图。如图5所示,该数据压缩装置可以包括:
第一压缩模块501,用于确定待处理字符串中非空闲的字符串,所述非空闲的字符串包括数据码,所述数据码的分布符合正态分布。
处理模块502,用于根据所述数据码和所述数据码的平均值,获得第一数据码,所述第一数据码包括至少一个待编码字符。
第二压缩模块503,用于对所述第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码。
获得模块504,用于根据所述待处理字符串中除所述数据码外的其它字 符、所述第一数据码和所述第二数据码,得到所述待处理字符串的压缩结果。
在本实施例的一种可能实现中,所述第一压缩模块501,具体用于:
获取记录的所述待处理字符串中非空闲数的位置;
基于所述位置,确定所述待处理字符串中非空闲的字符串。
在本实施例的一种可能实现中,所述第一压缩模块501,具体用于:
确定所述待处理字符串中的非空闲数和所述非空闲数的列号;
基于所述非空闲数和所述非空闲数的列号,获得所述待处理字符串中非空闲的字符串。
在本实施例的一种可能实现中,所述第二压缩模块503,具体用于:
确定各所述待编码字符在所述第一数据码中的出现频次;
根据所述出现频次,对所述待编码字符进行二进制编码,获得所述第二数据码,其中,出现频次高的所述待编码字符进行二进制编码后的长度小于出现频次低的所述待编码字符进行二进制编码后的长度。
在本实施例的一种可能实现中,所述第二数据码至少包括分隔符。
所述第二压缩模块503,具体用于:
按照各所述待编码字符在所述第一数据码中出现频次由高到底的顺序,确定各所述待编码字符的频次序号,所述频次序号为由1开始顺序标识的正整数;
根据所述频次序号和所述分隔符,对所述待编码字符进行二进制编码,获得所述第二数据码。
在本实施例的一种可能实现中,所述第一数据码包括根据所述频次序号和预置阈值划分的第一字符集和第二字符集,所述第一字符集中的第一待编码字符的频次序号小于或等于所述预置阈值,所述第二字符集中的第二待编码字符的频次序号大于所述预置阈值;
针对所述第一字符集,所述分隔符包括取值相反的二进制的第一前缀符和结尾符,所述第一前缀符的位数等于所述频次序号的取值减1;
针对所述第二字符集,所述分隔符包括取值相反的二进制的第二前缀符和结尾符,所述第二前缀符的位数比位数最多的所述第一前缀符的位数至少多1位。
在本实施例的一种可能实现中,所述第二待编码字符的二进制编码还包括根据所对应频次序号的二进制数确定的中间符。
在本实施例的一种可能实现中,若所述第二待编码字符对应频次序号减1的二进制位数小于或等于所述预置阈值,则所述中间符的位数等于所述预置阈值加1;
若所述第二待编码字符对应频次序号减1的二进制位数大于所述预置阈值,则所述中间符的位数等于所述第二待编码字符对应频次序号减1的二进制位数。
在本实施例的一种可能实现中,所述结尾符为一位的1。
在本实施例的一种可能实现中,所述第二压缩模块503,具体用于:
确定所述第一数据码中预设比特位的待编码字符,所述预设比特位高于所述第一数据码中除所述预设比特位的待编码字符外其它待编码字符的比特位;
对所述预设比特位的待编码字符进行二进制编码,获得第二数据码。
在本实施例的一种可能实现中,所述处理模块502,具体用于:
计算所述数据码与所述数据码的平均值的差值;
基于所述差值,获得所述第一数据码。
本申请实施例提供的装置,可用于执行上述数据压缩方法实施例的技术方案,其实现原理和技术效果类似,在此不再赘述。
需要说明的是,应理解以上装置的各个模块的划分仅仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。且这些模块可以全部以软件通过处理元件调用的形式实现;也可以全部以硬件的形式实现;还可以部分模块通过处理元件调用软件的形式实现,部分模块通过硬件的形式实现。例如,处理模块可以为单独设立的处理元件,也可以集成在上述装置的某一个芯片中实现,此外,也可以以程序代码的形式存储于上述装置的存储器中,由上述装置的某一个处理元件调用并执行以上模块的功能。其它模块的实现与之类似。此外这些模块全部或部分可以集成在一起,也可以独立实现。这里所述的处理元件可以是一种集成电路,具有信号的处理能力。在实现过程中,上述方法的各步骤或以上各个模块可以通过处理器元件中的硬件的集成逻辑电路或者软件形式的指令完成。
例如,以上这些模块可以是被配置成实施以上方法的一个或多个集成电路,例如:一个或多个特定集成电路(application specific integrated circuit,ASIC),或,一个或多个微处理器(digital signal processor,DSP),或,一个或者多个现场可编程门阵列(field programmable gate array,FPGA)等。再如,当以上某个模块通过处理元件调度程序代码的形式实现时,该处理元件可以是通用处理器,例如,中央处理器(central processing unit,CPU)或其它可以调用程序代码的处理器,例如,智能处理器(intelligence processing unit,IPU)。再如,这些模块可以集成在一起,以片上系统(system-on-a-chip,SOC)的形式实现。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
可选的,图6为本申请实施例提供的电子设备的结构示意图。如图6所示,该电子设备可以包括:处理器601、存储器602、通信接口603和系统总线604。其中,存储器602和通信接口603通过系统总线604与处理器601连接并完成相互间的通信,存储器602用于存储计算机程序指令,通信接口603用于和其他设备进行通信,处理器601执行上述计算机程序指令时实现如上述方法实施例的技术方案。
该图6中提到的系统总线可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述系统总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。通信接口用于实现电子设备与其他设备(例如客户端、读写库和只读库)之间的通信。存储器可能包含随机存取存储器(random access memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
上述的处理器可以是通用处理器,包括中央处理器CPU、网络处理器(network processor,NP)等;可以是专用处理器,包括图形处理器GPU、智能处理器IPU等;还可以是数字信号处理器DSP、专用集成电路ASIC、现场可编程门阵列FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
可选的,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,所述计算机指令被处理器执行时用于实现如上述方法实施例的技术方案。
可选的,本申请实施例还提供一种运行指令的芯片,所述芯片用于执行上述方法实施例的技术方案。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序存储在计算机可读存储介质中,至少一个处理器可以从所述计算机可读存储介质中读取所述计算机程序,所述至少一个处理器执行所述计算机程序时可实现上述方法实施例的技术方案。
依据以下条款可更好地理解前述内容:
条款A1.一种数据压缩方法,包括:
确定待处理字符串中非空闲的字符串,所述非空闲的字符串中包括数据码,所述数据码的分布符合正态分布;
根据所述数据码和所述数据码的平均值,获得第一数据码,所述第一数据码包括至少一个待编码字符;
对所述第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码;
根据所述待处理字符串中除所述数据码外的其它字符、所述第一数据 码和所述第二数据码,得到所述待处理字符串的压缩结果。
条款A2.根据条款A1所述的方法,所述确定待处理字符串中非空闲的字符串,包括:
获取记录的所述待处理字符串中非空闲数的位置;
基于所述位置,确定所述待处理字符串中非空闲的字符串。
条款A3.根据条款A1所述的方法,所述确定待处理字符串中非空闲的字符串,包括:
确定所述待处理字符串中的非空闲数和所述非空闲数的列号;
基于所述非空闲数和所述非空闲数的列号,获得所述待处理字符串中非空闲的字符串。
条款A4.根据条款A1-A3中任一项所述的方法,所述对所述第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码,包括:
确定各所述待编码字符在所述第一数据码中的出现频次;
根据所述出现频次,对所述待编码字符进行二进制编码,获得所述第二数据码,其中,出现频次高的所述待编码字符进行二进制编码后的长度小于出现频次低的所述待编码字符进行二进制编码后的长度。
条款A5.根据条款A4所述的方法,所述第二数据码至少包括分隔符;
所述根据所述出现频次,对所述待编码字符进行二进制编码,获得所述第二数据码,包括:
按照各所述待编码字符在所述第一数据码中出现频次由高到底的顺序,确定各所述待编码字符的频次序号,所述频次序号为由1开始顺序标识的正整数;
根据所述频次序号和所述分隔符,对所述待编码字符进行二进制编码,获得所述第二数据码。
条款A6.根据条款A5所述的方法,
所述第一数据码包括根据所述频次序号和预置阈值划分的第一字符集和第二字符集,所述第一字符集中的第一待编码字符的频次序号小于或等于所述预置阈值,所述第二字符集中的第二待编码字符的频次序号大于所述预置阈值;
针对所述第一字符集,所述分隔符包括取值相反的二进制的第一前缀符 和结尾符,所述第一前缀符的位数等于所述频次序号的取值减1;
针对所述第二字符集,所述分隔符包括取值相反的二进制的第二前缀符和结尾符,所述第二前缀符的位数比位数最多的所述第一前缀符的位数至少多1位。
条款A7.根据条款A6所述的方法,
所述第二待编码字符的二进制编码还包括根据所对应频次序号的二进制数确定的中间符。
条款A8.根据条款A7所述的方法,
若所述第二待编码字符对应频次序号减1的二进制位数小于或等于所述预置阈值,则所述中间符的位数等于所述预置阈值加1;
若所述第二待编码字符对应频次序号减1的二进制位数大于所述预置阈值,则所述中间符的位数等于所述第二待编码字符对应频次序号减1的二进制位数。
条款A9.根据条款A6所述的方法,所述结尾符为一位的1。
条款A10.根据条款A1-A3任一项所述的方法,所述对所述第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码,包括:
确定所述第一数据码中预设比特位的待编码字符,所述预设比特位高于所述第一数据码中除所述预设比特位的待编码字符外其它待编码字符的比特位;
对所述预设比特位的待编码字符进行二进制编码,获得第二数据码。
条款A11.根据条款A1-A3任一项所述的方法,所述根据所述数据码和所述数据码的平均值,获得第一数据码,包括:
计算所述数据码与所述数据码的平均值的差值;
基于所述差值,获得所述第一数据码。
条款A12.一种数据压缩装置,包括:
第一压缩模块,用于确定待处理字符串中非空闲的字符串,所述非空闲的字符串包括数据码,所述数据码的分布符合正态分布;
处理模块,用于根据所述数据码和所述数据码的平均值,获得第一数据码,所述第一数据码包括至少一个待编码字符;
第二压缩模块,用于对所述第一数据码中至少一个待编码字符进行二进 制编码,获得第二数据码;
获得模块,用于根据所述待处理字符串中除所述数据码外的其它字符、所述第一数据码和所述第二数据码,得到所述待处理字符串的压缩结果。
条款A13.根据条款A12所述的装置,所述第一压缩模块,具体用于:
获取记录的所述待处理字符串中非空闲数的位置;
基于所述位置,确定所述待处理字符串中非空闲的字符串。
条款A14.根据条款A12所述的装置,所述第一压缩模块,具体用于:
确定所述待处理字符串中的非空闲数和所述非空闲数的列号;
基于所述非空闲数和所述非空闲数的列号,获得所述待处理字符串中非空闲的字符串。
条款A15.根据条款A12至A14中任一项所述的装置,所述第二压缩模块,具体用于:
确定各所述待编码字符在所述第一数据码中的出现频次;
根据所述出现频次,对所述待编码字符进行二进制编码,获得所述第二数据码,其中,出现频次高的所述待编码字符进行二进制编码后的长度小于出现频次低的所述待编码字符进行二进制编码后的长度。
条款A26.根据条款A15所述的装置,所述第二数据码至少包括分隔符;
所述第二压缩模块,具体用于:
按照各所述待编码字符在所述第一数据码中出现频次由高到底的顺序,确定各所述待编码字符的频次序号,所述频次序号为由1开始顺序标识的正整数;
根据所述频次序号和所述分隔符,对所述待编码字符进行二进制编码,获得所述第二数据码。
条款A17.根据条款A16所述的装置,
所述第一数据码包括根据所述频次序号和预置阈值划分的第一字符集和第二字符集,所述第一字符集中的第一待编码字符的频次序号小于或等于所述预置阈值,所述第二字符集中的第二待编码字符的频次序号大于所述预置阈值;
针对所述第一字符集,所述分隔符包括取值相反的二进制的第一前缀符和结尾符,所述第一前缀符的位数等于所述频次序号的取值减1;
针对所述第二字符集,所述分隔符包括取值相反的二进制的第二前缀符和结尾符,所述第二前缀符的位数比位数最多的所述第一前缀符的位数至少多1位。
条款A18.根据条款A17所述的装置,
所述第二待编码字符的二进制编码还包括根据所对应频次序号的二进制数确定的中间符。
条款A19.根据条款A18所述的装置,
若所述第二待编码字符对应频次序号减1的二进制位数小于或等于所述预置阈值,则所述中间符的位数等于所述预置阈值加1;
若所述第二待编码字符对应频次序号减1的二进制位数大于所述预置阈值,则所述中间符的位数等于所述第二待编码字符对应频次序号减1的二进制位数。
条款A20.根据条款A17所述的装置,所述结尾符为一位的1。
条款A21.根据条款A12至A14中任一项所述的装置,所述第二压缩模块,具体用于:
确定所述第一数据码中预设比特位的待编码字符,所述预设比特位高于所述第一数据码中除所述预设比特位的待编码字符外其它待编码字符的比特位;
对所述预设比特位的待编码字符进行二进制编码,获得第二数据码。
条款A22.根据条款A12至A14中任一项所述的装置,所述处理模块,具体用于:
计算所述数据码与所述数据码的平均值的差值;
基于所述差值,获得所述第一数据码。
条款A23.一种电子设备,包括:处理器、存储器及存储在所述存储器上并可在处理器上运行的计算机程序指令;
所述处理器执行所述计算机程序指令时实现如上述条款A1至条款A11任一项所述的数据压缩方法。
条款A24.一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,所述计算机指令被处理器执行时用于实现如上述条款A1至条款A11任一项所述的数据压缩方法。
条款A25.一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现上述条款A1至条款A11任一项所述的数据压缩方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求书指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求书来限制。
Claims (25)
- 一种数据压缩方法,其特征在于,包括:确定待处理字符串中非空闲的字符串,所述非空闲的字符串中包括数据码,所述数据码的分布符合正态分布;根据所述数据码和所述数据码的平均值,获得第一数据码,所述第一数据码包括至少一个待编码字符;对所述第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码;根据所述待处理字符串中除所述数据码外的其它字符、所述第一数据码和所述第二数据码,得到所述待处理字符串的压缩结果。
- 根据权利要求1所述的方法,其特征在于,所述确定待处理字符串中非空闲的字符串,包括:获取记录的所述待处理字符串中非空闲数的位置;基于所述位置,确定所述待处理字符串中非空闲的字符串。
- 根据权利要求1所述的方法,其特征在于,所述确定待处理字符串中非空闲的字符串,包括:确定所述待处理字符串中的非空闲数和所述非空闲数的列号;基于所述非空闲数和所述非空闲数的列号,获得所述待处理字符串中非空闲的字符串。
- 根据权利要求1-3中任一项所述的方法,其特征在于,所述对所述第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码,包括:确定各所述待编码字符在所述第一数据码中的出现频次;根据所述出现频次,对所述待编码字符进行二进制编码,获得所述第二数据码,其中,出现频次高的所述待编码字符进行二进制编码后的长度小于出现频次低的所述待编码字符进行二进制编码后的长度。
- 根据权利要求4所述的方法,其特征在于,所述第二数据码至少包括分隔符;所述根据所述出现频次,对所述待编码字符进行二进制编码,获得所述第二数据码,包括:按照各所述待编码字符在所述第一数据码中出现频次由高到底的顺序, 确定各所述待编码字符的频次序号,所述频次序号为由1开始顺序标识的正整数;根据所述频次序号和所述分隔符,对所述待编码字符进行二进制编码,获得所述第二数据码。
- 根据权利要求5所述的方法,其特征在于,所述第一数据码包括根据所述频次序号和预置阈值划分的第一字符集和第二字符集,所述第一字符集中的第一待编码字符的频次序号小于或等于所述预置阈值,所述第二字符集中的第二待编码字符的频次序号大于所述预置阈值;针对所述第一字符集,所述分隔符包括取值相反的二进制的第一前缀符和结尾符,所述第一前缀符的位数等于所述频次序号的取值减1;针对所述第二字符集,所述分隔符包括取值相反的二进制的第二前缀符和结尾符,所述第二前缀符的位数比位数最多的所述第一前缀符的位数至少多1位。
- 根据权利要求6所述的方法,其特征在于,所述第二待编码字符的二进制编码还包括根据所对应频次序号的二进制数确定的中间符。
- 根据权利要求7所述的方法,其特征在于,若所述第二待编码字符对应频次序号减1的二进制位数小于或等于所述预置阈值,则所述中间符的位数等于所述预置阈值加1;若所述第二待编码字符对应频次序号减1的二进制位数大于所述预置阈值,则所述中间符的位数等于所述第二待编码字符对应频次序号减1的二进制位数。
- 根据权利要求6所述的方法,其特征在于,所述结尾符为一位的1。
- 根据权利要求1-3中任一项所述的方法,其特征在于,所述对所述第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码,包括:确定所述第一数据码中预设比特位的待编码字符,所述预设比特位高于所述第一数据码中除所述预设比特位的待编码字符外其它待编码字符的比特位;对所述预设比特位的待编码字符进行二进制编码,获得第二数据码。
- 根据权利要求1-3任一项所述的方法,其特征在于,所述根据所述数据码和所述数据码的平均值,获得第一数据码,包括:计算所述数据码与所述数据码的平均值的差值;基于所述差值,获得所述第一数据码。
- 一种数据压缩装置,其特征在于,包括:第一压缩模块,用于确定待处理字符串中非空闲的字符串,所述非空闲的字符串包括数据码,所述数据码的分布符合正态分布;处理模块,用于根据所述数据码和所述数据码的平均值,获得第一数据码,所述第一数据码包括至少一个待编码字符;第二压缩模块,用于对所述第一数据码中至少一个待编码字符进行二进制编码,获得第二数据码;获得模块,用于根据所述待处理字符串中除所述数据码外的其它字符、所述第一数据码和所述第二数据码,得到所述待处理字符串的压缩结果。
- 根据权利要求12所述的装置,其特征在于,所述第一压缩模块,具体用于:获取记录的所述待处理字符串中非空闲数的位置;基于所述位置,确定所述待处理字符串中非空闲的字符串。
- 根据权利要求12所述的装置,其特征在于,所述第一压缩模块,具体用于:确定所述待处理字符串中的非空闲数和所述非空闲数的列号;基于所述非空闲数和所述非空闲数的列号,获得所述待处理字符串中非空闲的字符串。
- 根据权利要求12至14中任一项所述的装置,其特征在于,所述第二压缩模块,具体用于:确定各所述待编码字符在所述第一数据码中的出现频次;根据所述出现频次,对所述待编码字符进行二进制编码,获得所述第二数据码,其中,出现频次高的所述待编码字符进行二进制编码后的长度小于出现频次低的所述待编码字符进行二进制编码后的长度。
- 根据权利要求15所述的装置,其特征在于,所述第二数据码至少 包括分隔符;所述第二压缩模块,具体用于:按照各所述待编码字符在所述第一数据码中出现频次由高到底的顺序,确定各所述待编码字符的频次序号,所述频次序号为由1开始顺序标识的正整数;根据所述频次序号和所述分隔符,对所述待编码字符进行二进制编码,获得所述第二数据码。
- 根据权利要求16所述的装置,其特征在于,所述第一数据码包括根据所述频次序号和预置阈值划分的第一字符集和第二字符集,所述第一字符集中的第一待编码字符的频次序号小于或等于所述预置阈值,所述第二字符集中的第二待编码字符的频次序号大于所述预置阈值;针对所述第一字符集,所述分隔符包括取值相反的二进制的第一前缀符和结尾符,所述第一前缀符的位数等于所述频次序号的取值减1;针对所述第二字符集,所述分隔符包括取值相反的二进制的第二前缀符和结尾符,所述第二前缀符的位数比位数最多的所述第一前缀符的位数至少多1位。
- 根据权利要求17所述的装置,其特征在于,所述第二待编码字符的二进制编码还包括根据所对应频次序号的二进制数确定的中间符。
- 根据权利要求18所述的装置,其特征在于,若所述第二待编码字符对应频次序号减1的二进制位数小于或等于所述预置阈值,则所述中间符的位数等于所述预置阈值加1;若所述第二待编码字符对应频次序号减1的二进制位数大于所述预置阈值,则所述中间符的位数等于所述第二待编码字符对应频次序号减1的二进制位数。
- 根据权利要求17所述的装置,其特征在于,所述结尾符为一位的1。
- 根据权利要求12至14中任一项所述的装置,其特征在于,所述第二压缩模块,具体用于:确定所述第一数据码中预设比特位的待编码字符,所述预设比特位高于所述第一数据码中除所述预设比特位的待编码字符外其它待编码字符的比特位;对所述预设比特位的待编码字符进行二进制编码,获得第二数据码。
- 根据权利要求12至14中任一项所述的装置,其特征在于,所述处理模块,具体用于:计算所述数据码与所述数据码的平均值的差值;基于所述差值,获得所述第一数据码。
- 一种电子设备,其特征在于,包括:处理器、存储器及存储在所述存储器上并可在处理器上运行的计算机程序指令,所述处理器执行所述计算机程序指令时实现如上述权利要求1-11任一项所述的方法。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机指令,所述计算机指令被处理器执行时用于实现如上述权利要求1-11任一项所述的方法。
- 一种计算机程序产品,其特征在于,包括计算机程序,所述计算机程序被处理器执行时实现上述权利要求1-11任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210617490.4A CN117200800A (zh) | 2022-06-01 | 2022-06-01 | 数据压缩方法、装置、设备及存储介质 |
CN202210617490.4 | 2022-06-01 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023231313A1 true WO2023231313A1 (zh) | 2023-12-07 |
Family
ID=88983846
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/132677 WO2023231313A1 (zh) | 2022-06-01 | 2022-11-17 | 数据压缩方法、装置、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117200800A (zh) |
WO (1) | WO2023231313A1 (zh) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102457283A (zh) * | 2010-10-28 | 2012-05-16 | 阿里巴巴集团控股有限公司 | 一种数据压缩、解压缩方法及设备 |
CN107592116A (zh) * | 2017-09-21 | 2018-01-16 | 咪咕文化科技有限公司 | 一种数据压缩方法、装置及存储介质 |
CN112131865A (zh) * | 2020-09-11 | 2020-12-25 | 成都运达科技股份有限公司 | 一种轨道交通报文数字压缩处理方法、装置及存储介质 |
CN113542225A (zh) * | 2021-06-17 | 2021-10-22 | 深圳市合广测控技术有限公司 | 一种数据的压缩方法、装置、终端设备及存储介质 |
-
2022
- 2022-06-01 CN CN202210617490.4A patent/CN117200800A/zh active Pending
- 2022-11-17 WO PCT/CN2022/132677 patent/WO2023231313A1/zh unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102457283A (zh) * | 2010-10-28 | 2012-05-16 | 阿里巴巴集团控股有限公司 | 一种数据压缩、解压缩方法及设备 |
CN107592116A (zh) * | 2017-09-21 | 2018-01-16 | 咪咕文化科技有限公司 | 一种数据压缩方法、装置及存储介质 |
CN112131865A (zh) * | 2020-09-11 | 2020-12-25 | 成都运达科技股份有限公司 | 一种轨道交通报文数字压缩处理方法、装置及存储介质 |
CN113542225A (zh) * | 2021-06-17 | 2021-10-22 | 深圳市合广测控技术有限公司 | 一种数据的压缩方法、装置、终端设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN117200800A (zh) | 2023-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020253406A1 (zh) | 一种数据处理方法、装置及计算机可读存储介质 | |
US11463102B2 (en) | Data compression method, data decompression method, and related apparatus, electronic device, and system | |
US10680645B2 (en) | System and method for data storage, transfer, synchronization, and security using codeword probability estimation | |
WO2023279964A1 (zh) | 数据压缩方法、装置、计算设备及存储系统 | |
US10509582B2 (en) | System and method for data storage, transfer, synchronization, and security | |
US9966971B2 (en) | Character conversion | |
CN108197324B (zh) | 用于存储数据的方法和装置 | |
CN115599757A (zh) | 数据压缩方法、装置、计算设备及存储系统 | |
US11755540B2 (en) | Chunking method and apparatus | |
WO2020083019A1 (zh) | 一种基于多核处理器的解码方法、终端设备及存储介质 | |
CN115567589B (zh) | Json数据的压缩传输方法、装置、设备及存储介质 | |
WO2023061177A1 (zh) | 基于列式数据扫描的多数据发送和接收方法、装置和设备 | |
CN115483935A (zh) | 一种数据处理方法及装置 | |
US20140258247A1 (en) | Electronic apparatus for data access and data access method therefor | |
US20240248891A1 (en) | Data Compression Method and Apparatus | |
CN115202573A (zh) | 数据存储系统以及方法 | |
US10361715B1 (en) | Decompression circuit | |
WO2023231313A1 (zh) | 数据压缩方法、装置、设备及存储介质 | |
WO2024066753A1 (zh) | 压缩数据的方法和相关装置 | |
CN116340246B (zh) | 用于直接内存访问读取操作的数据预读方法及介质 | |
WO2023207295A1 (zh) | 数据处理方法、数据处理单元、系统及相关设备 | |
CN110610450B (zh) | 数据处理方法、电子设备和计算机可读存储介质 | |
US9697899B1 (en) | Parallel deflate decoding method and apparatus | |
WO2022213073A1 (en) | Sparse machine learning acceleration | |
CN114422452A (zh) | 数据传输方法、装置、处理设备、存储介质和芯片 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22944615 Country of ref document: EP Kind code of ref document: A1 |