WO2021109696A1 - 数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置 - Google Patents

数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置 Download PDF

Info

Publication number
WO2021109696A1
WO2021109696A1 PCT/CN2020/118674 CN2020118674W WO2021109696A1 WO 2021109696 A1 WO2021109696 A1 WO 2021109696A1 CN 2020118674 W CN2020118674 W CN 2020118674W WO 2021109696 A1 WO2021109696 A1 WO 2021109696A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
compression
compressed
compressed data
processing
Prior art date
Application number
PCT/CN2020/118674
Other languages
English (en)
French (fr)
Inventor
王骁
张楠赓
Original Assignee
嘉楠明芯(北京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 嘉楠明芯(北京)科技有限公司 filed Critical 嘉楠明芯(北京)科技有限公司
Publication of WO2021109696A1 publication Critical patent/WO2021109696A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3059Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
    • H03M7/3064Segmenting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the invention belongs to the field of data compression and decompression, and specifically relates to data compression, decompression, and processing methods and devices based on data compression and decompression.
  • the embodiment of the present invention proposes a data compression, decompression, and processing method and device based on data compression and decompression. With this method and device, the above-mentioned problems can be solved.
  • a data compression method including: receiving data to be compressed, the data to be compressed is sparse data output by any layer of a neural network model; performing compression processing on the compressed data based on a sparse compression algorithm and a bit-plane compression algorithm to obtain The data is compressed.
  • the data format of the data to be compressed is fixed-point data or floating-point data.
  • the data to be compressed is high-order tensor data
  • the compression processing performed includes: performing compression processing on the data to be compressed in units of behavior.
  • the executed compression processing includes a first compression processing based on a sparse compression algorithm, and the first compression processing includes: performing sparse compression on the data to be compressed in parallel, and outputting the first compressed data and the first additional information , Wherein the first compressed data is formed as closely arranged non-zero data.
  • the first additional information includes a bitmap and first line length information, where the bitmap is used to indicate the position of non-zero data in the data to be compressed, and the first line length information is used to indicate The data length of each row of data in the data to be compressed after the first compression processing.
  • the executed compression processing includes a second compression processing based on a bit-plane compression algorithm
  • the second compression processing includes: packetizing the to-be-compressed data or the first compressed data to obtain multiple data packets; Multiple data packets are subjected to bit-plane compression preprocessing; the multiple data packets that have undergone bit-plane compression pre-processing are distributed to multiple encoders to perform bit-plane compression processing in parallel to obtain multiple BPC encoded packets; The BPC encoded packets are combined to output the second compressed data and the second additional information.
  • the second additional information includes: second line length information and subpacket length information, where the second line length information is used to indicate that each line of data in the data to be compressed is processed after the second compression process.
  • Data length, sub-packet length information is used to indicate the data length of each BPC encoded packet.
  • it further includes: using a multiplexing mechanism to perform bit-plane compression preprocessing on multiple data packets, respectively.
  • packetizing the to-be-compressed data or the first compressed data further includes: packetizing the to-be-compressed data or the first compressed data according to a preset length; wherein, if the last one of the multiple data packets is If the packet is less than the preset length, the last packet of the multiple data packets will not be compressed; or the last packet of the multiple data packets will be filled with zeros.
  • performing compression processing on the to-be-compressed data based on the sparse compression algorithm and the bit-plane compression algorithm further includes: in response to the first enable instruction, sequentially executing the first compression processing and the second compression processing, and outputting the first Second, the compressed data, the first additional information, and the second additional information are used as compressed data; in response to the second enable instruction, the first compression process is performed separately, and the first compressed data and the first additional information are output as compressed data; in response to The third enable instruction separately executes the second compression process, and outputs the second compressed data and the second additional information as compressed data.
  • a data decompression method including: receiving compressed data, the compressed data is generated using the data compression method according to any one of claims 1-10; and using any one of claims 1-8
  • the reverse step of the data compression method of the item performs decompression processing on the compressed data to restore the compressed data.
  • a processing method based on data compression and decompression includes: using the method according to any one of claims 1-10 to perform compression processing on the compressed data to obtain compressed data, and transmit the compressed data to Stored in external memory; obtain compressed data stored in external memory, and use the method of claim 11 to perform decompression processing on the compressed data, restore the compressed data to sparse data, and input the restored sparse data into the nerve Network model to perform calculations.
  • a data compression device including: a receiving module for receiving data to be compressed, the data to be compressed is sparse data output by any layer of a neural network model; a compression module for sparse compression algorithm and bit plane based The compression algorithm performs compression processing on the compressed data to obtain compressed data.
  • the data format of the data to be compressed is fixed-point data or floating-point data.
  • the data to be compressed is high-order tensor data
  • the compression module is further configured to perform compression processing on the data to be compressed in units of lines.
  • the compression module further includes a first compression unit based on a sparse compression algorithm, configured to: perform sparse compression in parallel on the data to be compressed, and output the first compressed data and the first additional information, wherein the first compression The data is formed as closely arranged non-zero data.
  • the first additional information includes a bitmap and first line length information, where the bitmap is used to indicate the position of non-zero data in the data to be compressed, and the first line length information is used to indicate The data length of each row of data in the data to be compressed after the first compression processing.
  • the compression module further includes a second compression unit based on a bit-plane compression algorithm, configured to: packetize the to-be-compressed data or the first compressed data to obtain multiple data packets; Perform bit-plane compression preprocessing; distribute multiple data packets after bit-plane compression pre-processing to multiple encoders to perform bit-plane compression in parallel to obtain multiple BPC encoded packets; combine multiple BPC encoded packets to The second compressed data and the second additional information are output.
  • a second compression unit based on a bit-plane compression algorithm, configured to: packetize the to-be-compressed data or the first compressed data to obtain multiple data packets; Perform bit-plane compression preprocessing; distribute multiple data packets after bit-plane compression pre-processing to multiple encoders to perform bit-plane compression in parallel to obtain multiple BPC encoded packets; combine multiple BPC encoded packets to The second compressed data and the second additional information are output.
  • the second additional information includes: second line length information and subpacket length information, where the second line length information is used to indicate that each line of data in the data to be compressed is processed after the second compression process.
  • Data length, sub-packet length information is used to indicate the data length of each BPC encoded packet.
  • the second compression unit is further configured to: use a multiplexing mechanism to perform bit-plane compression preprocessing on multiple data packets, respectively.
  • the second compression unit is further configured to: packetize the compressed data or the first compressed data according to a preset length; wherein, if the last packet of the multiple data packets is less than the preset length, Do not compress the last packet of multiple data packets, or perform zero-filling processing on the last packet of multiple data packets.
  • the compression module is further configured to: in response to the first enable instruction, execute the first compression process and the second compression process sequentially, and output the second compressed data, the first additional information, and the second additional information As compressed data; in response to the second enable instruction, the first compression process is executed separately, and the first compressed data and the first additional information are output as compressed data; in response to the third enable instruction, the second compression process is executed separately, The second compressed data and the second additional information are output as compressed data.
  • a data decompression device including: an acquisition module for acquiring compressed data, the compressed data is generated using the data compression method in the first aspect; and a decompression module for using The reverse step of the data compression method in any one of 1-8 performs decompression processing on the compressed data to restore the compressed data.
  • a neural network processing device based on data compression and decompression
  • a data compression device configured to perform compression processing on the compressed data using the method as in the first aspect to obtain compressed data, and combine the compressed data Transmitted and stored in external storage
  • data decompression device for obtaining compressed data stored in external storage, and using the method as described in the second aspect to perform decompression processing on the compressed data, thereby restoring the compressed data to sparse data , And input the recovered sparse data into the neural network model to perform calculations.
  • the above-mentioned at least one technical solution adopted in the embodiments of the present application can achieve the following beneficial effects:
  • a higher compression rate can be achieved, and the data transmission bandwidth and external data can be saved.
  • the storage space of the memory improves the memory access efficiency and enhances the computing power of the chip.
  • Figure 1 is a schematic structural diagram of a schematic neural network chip
  • FIG. 2 is a schematic flowchart of a data compression method according to an embodiment of the present invention.
  • Fig. 3 is a schematic diagram of data compression according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a first compression process based on a sparse compression algorithm according to an embodiment of the present invention
  • FIG. 5 is a schematic flowchart of a data compression method according to another embodiment of the present invention.
  • Fig. 6 is a schematic diagram of a second compression process based on a bit-plane compression algorithm according to an embodiment of the present invention
  • FIG. 7 is a schematic diagram of a first bit plane compression preprocessing according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a second bit plane compression preprocessing according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a third bit plane compression preprocessing according to an embodiment of the present invention.
  • FIG. 10 is a schematic flowchart of a data decompression method according to an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a data compression device according to an embodiment of the present invention.
  • Figure 12 is a schematic diagram of a data decompression device according to an embodiment of the present invention.
  • FIG. 13 is a schematic structural diagram of a processing device based on data compression and decompression according to an embodiment of the present invention.
  • Fig. 14 is a schematic structural diagram of a data compression or decompression device according to an embodiment of the present invention.
  • FIG. 1 is a schematic structural diagram of a schematic neural network chip 10.
  • the neural network processing unit 11 is mainly used for neural network calculations, and specifically may include an arithmetic unit 12 and an internal memory 13.
  • the internal memory 13 usually adopts SRAM (Static Random-Access Memory, static random access memory), and due to its high cost, large-capacity internal memory is usually avoided in practical applications.
  • DRAM Dynamic Random Access Memory
  • the neural network model can be deployed on the computing unit 12 for data processing.
  • the neural network model includes multiple layers. In the actual neural network operation process, it is necessary to store the intermediate data output by each layer of the neural network model, and reuse the stored intermediate data in the operation process of the subsequent layers. However, due to the limited storage space of the internal memory 13 of the neural network processing unit 11, it is usually necessary to store the intermediate data output by the above various layers in the external memory 14, and then read it from the external memory 14 when needed later.
  • FIG. 2 shows a schematic diagram of a method flow of a data compression method 20 according to an embodiment of the present invention.
  • the method 20 includes:
  • Step 21 Receive data to be compressed, where the data to be compressed is sparse data output by any layer of the neural network model;
  • Step 22 Perform compression processing on the to-be-compressed data based on the sparse compression algorithm and the bit-plane compression algorithm to obtain compressed data.
  • the data to be compressed may be intermediate data output by any layer of the neural network model after the image data is input to the neural network model in the neural network operation process.
  • the intermediate data can be feature map data of image data.
  • Feature map data usually exhibits sparse characteristics, such as a large number of zero values or a large number of determined values after specific operations. All 0 or all 1 etc.
  • the compressed data can be compressed based on the sparse compression algorithm and the bit plane compression algorithm, and the compressed data
  • the obtained compressed data is transmitted and stored in the external memory, which can save the data transmission bandwidth and the storage space of the external memory.
  • the cost and power consumption of the chip can be reduced, the memory access efficiency can be improved, and the computing power of the chip can be improved.
  • bit Plane Compression (BPC) algorithm is also a lossless compression algorithm, which includes at least BPC preprocessing and BPC coding. Among them, BPC preprocessing is performed by operations such as adjacent number subtraction, matrix transposition, and data exclusive OR. Increase the compressibility of the data, and then BPC encoding performs compression processing on the BPC pre-processed data according to the BPC encoding rules.
  • a two-stage compression process based on the SC algorithm and the BPC algorithm may be used to perform the compression process on the data to be compressed.
  • the data to be compressed is tensor data
  • each row of data in the data to be compressed can be sequentially input to the above-mentioned first compression unit, and the first compression unit performs compression processing based on the SC algorithm, and then outputs the first compressed data (closely arranged The non-zero value of ), the bitmap, and the line length information of each line of data after the first compression process.
  • the first compressed data can also be input to the second compression unit, and the second compression unit executes the second compression process based on the BPC algorithm, and then outputs the second compressed data and the line length information of each line of data after the second compression process, etc. .
  • the compression process can also be performed based on the SC algorithm alone or based on the BPC algorithm.
  • each row of data in the to-be-compressed data can be sequentially input to the above-mentioned first compression unit, and the first compressed data (closely arranged non-zero value), bitmap, and the row length of each row of data after the first compression process can be output
  • the information is treated as compressed data.
  • the second compression unit executes the second compression processing based on the BPC algorithm, and then outputs the second compressed data and the row of each row of data after the second compression processing. Length information, etc. are regarded as compressed data.
  • the data format of the data to be compressed is fixed-point data or floating-point data.
  • the data to be compressed may be composed of several fixed-point type data or several floating-point type data, for example, may be composed of several 16-bit floating-point numbers or 8-bit fixed-point numbers.
  • the input bit width of the first compression unit and the second compression unit can be the same, such as 128bit, then each clock cycle can be parallel Enter 8 16bit floating point numbers to support parallel compression processing.
  • the data to be compressed is high-order tensor data
  • the compression processing performed in step 22 includes: performing the compression processing on the data to be compressed in units of rows.
  • the high-order tensor data specifically refers to a feature map (feature map) output by each network layer during a neural network operation process, which can be formed as a second-order, third-order, or fourth-order.
  • the feature map can be a three-dimensional tensor including the number of channels, the number of rows, and the width of the rows. Its size can be expressed as c (number of channels) * h (number of rows) * w (line width).
  • all the data in the feature map to be compressed are sequentially compressed in line units and in the order of line by line and channel by channel.
  • the feature map can be a four-dimensional tensor including the number of frames, the number of channels, the number of rows, and the width of the rows. Its size can be expressed as n (number of frames) * c (number of channels) * h (number of rows) *w (line width), when performing compression processing, compress all the data in the feature map to be compressed in the order of line by line and channel by frame in line units.
  • the executed compression process includes a first compression process based on the sparse compression algorithm, and the first compression process includes: performing sparse compression in parallel on the data to be compressed, and outputting the first compression process.
  • the compressed data and the first additional information, wherein the first compressed data may be formed as closely arranged non-zero data.
  • the first additional information may include a bitmap (bitmap) and first line length information, where the bitmap is used to indicate the position of non-zero data in the data to be compressed, and the first additional information One line of length information is used to indicate the data length of each line of data in the to-be-compressed data after the first compression processing.
  • bitmap bitmap
  • first line length information is used to indicate the data length of each line of data in the to-be-compressed data after the first compression processing.
  • the compressed data when the compressed data is compressed in the unit of row, it is assumed that the current row to be compressed is d 0 to d 31 in Figure 4, and the format of each data is a 16-bit floating point number (hereinafter referred to as bf16). Therefore, in each clock cycle, 8 data can be input to the first compression unit in Figure 3 in parallel, for example, the first clock cycle is sent to d 0 to d 7 , the second clock cycle is sent to d 8 to d 15 , in turn Send all current row data d 0 to d 31 , and pull up the input completion signal to inform the first compression unit that the data input is complete.
  • the first compression unit can process sparse compression of 8 data in parallel, and pick out the non-zero values.
  • each bit in the compressed data to be used to indicate the corresponding data bit value is 0, for example in FIG. 4, d 7 non-zero value in the bitmap data corresponding to the position d 7 1, d 6 value of 0 in the bitmap data corresponding to the position d of 06.
  • the above-mentioned closely arranged non-zero value data is output as the first compressed data, and the bitmap and the first line length information are output, wherein the first line length information is used for Indicate the data length of each row of data in the to-be-compressed data after the first compression processing, and the total number of bits of the first compressed data corresponding to the current row.
  • the output bit width of the first compression process can be consistent with the input bit width, for example, both are 128 bits, that is, 8 16 bits can be input or output in parallel Floating point numbers, or 16 8-bit fixed-point numbers can be input or output in parallel.
  • the output bit width of the bitmap may be consistent with the bus bit width, for example, 64 bits.
  • the executed compression process includes a second compression process based on the bit-plane compression algorithm.
  • the foregoing second compression processing includes:
  • Step 51 packetize the to-be-compressed data or the first compressed data to obtain multiple data packets
  • Step 52 Perform bit plane compression preprocessing on the multiple data packets respectively;
  • Step 53 Distribute the multiple data packets that have undergone the bit-plane compression preprocessing to a multi-path encoder to perform bit-plane compression processing in parallel to obtain multiple BPC encoded packets;
  • Step 54 Combine the multiple BPC encoded packets to output second compressed data and second additional information.
  • the object of the second compression processing may be the data to be compressed, or the first compressed data output after the first compression processing, which is not specifically limited in the embodiment of the present application.
  • the input bit width of the second compression process can be consistent with the output bit width of the first compression process, for example, both are 128 bits.
  • the second compression process can also support the input of fixed-point data or floating-point data, for example, it can support multiple formats such as 16-bit floating-point numbers or 8-bit fixed-point numbers. For example, 8 16-bit floating-point numbers may be input in parallel to the second compression unit shown in FIG. 3, or 16 8-bit fixed-point numbers may be input in parallel.
  • the above step 51 may further include: packetizing the to-be-compressed data or the first compressed data according to a preset length; wherein, if the last packet of the multiple data packets is If it is less than the preset length, the last packet of the multiple data packets is not compressed; or the last packet of the multiple data packets is zero-filled processing.
  • the data set d 0 to d 98 can be divided into a group according to every 16 data to obtain 6 data packets, such as package0: d 0 ⁇ d 15 , package1: d 16 ⁇ d 31 , and so on.
  • the bit-plane compression preprocessing may generally include a first BPC preprocessing for performing adjacent number subtraction operations, a second BPC preprocessing for performing matrix transposition operations, and a second BPC preprocessing for performing matrix transposition operations, and The third BPC preprocessing of the neighbor exclusive OR operation, thereby increasing the compressibility of each data packet.
  • the foregoing step 52 may further include: using a multiplexing mechanism to perform the bit-plane compression preprocessing on the multiple data packets, respectively.
  • a multiplexing mechanism can be adopted, that is, multiple encoders can share a device for performing BPC preprocessing, and the BPC After the preprocessing is completed, multiple BPC preprocessed data packets are distributed to multiple encoders in sequence.
  • the second compression process based on the BPC algorithm is a purely serial encoding process in which the front and back data are dependent, and it is difficult to meet the rate requirements of the interaction between the chip and the external memory. Therefore, multiple encoders can be used in parallel.
  • the above-mentioned parallel processing means that the data is subpackaged and then distributed to different encoders to perform parallel BPC encoding, and finally the BPC encoded data is combined and output, which can meet the high-rate requirement. Therefore, additional packet control logic needs to be added to avoid data confusion, and packet length information needs to be known, where the packet length information is used to indicate the number of bits of each packet after BPC encoding.
  • the data to be compressed in the current row is formed as d 0 to d 98 (that is, data 0 to data 98 ) in Figure 6; or the data to be compressed in the current row is subjected to the first compression process, so
  • the output first compressed data is formed as d 0 to d 98 (that is, data 0 to data 98 ) in FIG. 6, and then performing the second compression process on d 0 to d 98 may specifically include:
  • step 51 you can perform step 51 to group every 16 data in d 0 ⁇ d 98 into one group to get 6 data packages, such as package0: d 0 ⁇ d 15 , package1: d 16 ⁇ d 31 ,...,package5 :d 96 ⁇ d 98 , and so on.
  • step 52 can be performed to perform BPC preprocessing in sequence for the above 6 data packets.
  • the BPC preprocessing may specifically include the first BPC preprocessing, the second BPC preprocessing, and the third BPC preprocessing.
  • Figures 7-9 respectively show the data processing process of the above three BPC preprocessing.
  • delta n data n -data n -1 to make the remaining data and adjacent data in turn Do subtraction and get (delta 1 ,...,delta 15 ), where n is a positive integer between 1-15.
  • two 16-bit data are subtracted to obtain a 17-bit subtraction result, and then a 16-bit base and 15 17-bit subtraction results (delta 1 ,..., delta 15 ) can be obtained.
  • the second BPC preprocessing includes: treating (delta 1 ,...,delta 15 ) as a 17bit*15 data matrix, and transposing the data matrix to obtain a 15bit*17 And define the 17 15- bit data as (DPB 0 ,...,DPB 16 ) in turn, so that we can get a 16-bit base and 17 15-bit DBP data.
  • step 53 can be performed to distribute the 6 data packets preprocessed by the BPC to the 6 encoders in the multi-channel encoder (16 channels in the figure) to perform parallel encoding.
  • packages 0 to 6 preprocessed by BPC can be sent to encoders 0 to 5 respectively, and the BPC encoding based on the BPC encoding rules is executed in parallel by 6 encoders, and finally 6 BPC encoding packages are output in parallel.
  • Each encoder performs encoding on the data packet according to the BPC encoding rule table shown in Table 1.
  • each data packet contains 18 data codes (1 base+17 DBX) after BPC preprocessing.
  • BPC is performed on each data packet after BPC preprocessing 18 clock cycles are required for encoding.
  • this embodiment can use parallel multi-channel encoders to perform parallel encoding processing, thereby obtaining Higher processing speed.
  • step 54 you can perform step 54.
  • the 6 BPC encoded packets output in parallel are combined into serial data according to the original subpacket logic, and the second compressed data (the serial data obtained after the combination) is output. ) And the second additional information.
  • the second additional information in step 54 may include: second line length information and packet length information, where the second line length information is used to indicate each line of data in the data to be compressed After the data length after the second compression processing, the subpacket length information is used to indicate the data length of each BPC encoded packet, thereby facilitating parallel processing during decompression.
  • the foregoing step 22 may specifically include: in response to the first enable instruction, sequentially executing the first compression processing and the second compression processing, and outputting the second compressed data and the first additional information. And the second additional information as the compressed data. In some other implementation manners, the foregoing step 22 may further include: separately executing the first compression process in response to a second enable instruction, and outputting the first compressed data and the first additional information as the compressed data Or in response to a third enable instruction to execute the second compression process alone, and output the second compressed data and the second additional information as the compressed data.
  • Table 2 takes the YoloV2-Relu network as an example, and takes the average value obtained by randomly selecting 50 images as an example.
  • the first compression processing (SC encoding) and the second compression processing (BPC encoding) are separately executed.
  • the first compression processing (SC encoding) and the second compression processing (BPC encoding) are executed sequentially, the compression ratios for each network layer under three different compression processing schemes.
  • the average compression rate can reach 33%, which means that the data transmission can be reduced by about 70%, which not only reduces the time of data transmission, but also saves
  • the bandwidth resources during the interaction of the external memory can then be more reasonably allocated to other units in the neural network processing device, which can greatly improve the performance of the device.
  • the parallel design ensures the compression speed, and the compression rates of upstream and downstream modules match.
  • an embodiment of the present invention also provides a data decompression method 100, as shown in FIG. 10, including:
  • Step 101 Receive compressed data, where the compressed data is generated using the data compression method shown in the foregoing embodiment
  • Step 102 Using the inverse step of the data compression method shown in the above embodiment, perform decompression processing on the compressed data to restore the compressed data.
  • the decompression process and the compression process are mutually inverse processes.
  • the data decompression in this embodiment adopts a process that is completely inverse to all aspects of the above-mentioned data compression method embodiment, and obtains the corresponding technical effects. Go into details again.
  • the embodiment of the present invention also provides a processing method based on data compression and decompression, including:
  • an embodiment of the present invention also provides a data compression device 110.
  • the data compression device 110 includes:
  • the receiving module 111 is configured to receive data to be compressed, and the data to be compressed is sparse data output by any layer of the neural network model;
  • the compression module 112 is configured to perform compression processing on the data to be compressed based on the sparse compression algorithm and the bit-plane compression algorithm to obtain compressed data.
  • the data format of the data to be compressed is fixed-point data or floating-point data.
  • the data to be compressed is high-order tensor data
  • the compression module is further configured to perform compression processing on the data to be compressed in units of lines.
  • the compression module 112 further includes a first compression unit based on a sparse compression algorithm, configured to: perform sparse compression on the data to be compressed in parallel, and output the first compressed data and the first additional information, where the first compression unit The compressed data is formed as closely arranged non-zero data.
  • the first additional information includes a bitmap and first line length information, where the bitmap is used to indicate the position of non-zero data in the data to be compressed, and the first line length information is used to indicate The data length of each row of data in the data to be compressed after the first compression processing.
  • the compression module 112 further includes a second compression unit based on a bit-plane compression algorithm, configured to: packetize the data to be compressed or the first compressed data to obtain multiple data packets; Perform bit-plane compression pre-processing separately; distribute multiple data packets after bit-plane compression pre-processing to multiple encoders to perform bit-plane compression processing in parallel to obtain multiple BPC encoded packets; combine multiple BPC encoded packets To output the second compressed data and the second additional information.
  • a second compression unit based on a bit-plane compression algorithm, configured to: packetize the data to be compressed or the first compressed data to obtain multiple data packets; Perform bit-plane compression pre-processing separately; distribute multiple data packets after bit-plane compression pre-processing to multiple encoders to perform bit-plane compression processing in parallel to obtain multiple BPC encoded packets; combine multiple BPC encoded packets To output the second compressed data and the second additional information.
  • the second additional information includes: second line length information and subpacket length information, where the second line length information is used to indicate that each line of data in the data to be compressed is processed after the second compression process.
  • Data length, sub-packet length information is used to indicate the data length of each BPC encoded packet.
  • the second compression unit is further configured to: use a multiplexing mechanism to perform bit-plane compression preprocessing on multiple data packets, respectively.
  • the second compression unit is further configured to: packetize the compressed data or the first compressed data according to a preset length; wherein, if the last packet of the multiple data packets is less than the preset length, Do not compress the last packet of multiple data packets, or perform zero-filling processing on the last packet of multiple data packets.
  • the compression module 112 is further configured to: in response to the first enable instruction, execute the first compression process and the second compression process sequentially, and output the second compressed data, the first additional information, and the second additional information.
  • the information is used as compressed data; in response to the second enable command, the first compression process is performed separately, and the first compressed data and the first additional information are output as compressed data; in response to the third enable command, the second compression process is performed separately , Output the second compressed data and the second additional information as compressed data.
  • an embodiment of the present invention also provides a data decompression device 120.
  • the data decompression device 120 includes:
  • the obtaining module 121 is configured to obtain compressed data, and the compressed data is generated using the data compression method shown in the foregoing embodiment;
  • the decompression module 122 is configured to perform decompression processing on the compressed data by using the reverse steps of the data compression method shown in the above embodiment to restore the compressed data.
  • an embodiment of the present invention also provides a processing device 130 based on data compression and decompression, as shown in FIG. 13, including:
  • the data compression device 131 is used to obtain the sparse data output by any layer of the neural network model from the computing unit 12 as the data to be compressed, and use the data compression method shown in the above embodiment to perform compression processing on the compressed data to obtain the compressed data. Compress the data, and transfer and store the compressed data in the external memory 14;
  • the data decompression device 132 is used to obtain compressed data stored in the external memory 14 and perform decompression processing on the compressed data using the data decompression method shown in the above-mentioned embodiment, thereby restoring the compressed data to Sparse data, and input the restored sparse data into the neural network model in the arithmetic unit 12 to perform calculations.
  • the data transmission bandwidth and the storage space of the external memory can be significantly saved, and the memory access efficiency can be improved.
  • the cost and power consumption of the chip can be reduced, and the computing power of the processing device can be improved.
  • FIG. 14 is a schematic diagram of a data compression or decompression device according to an embodiment of the present application, which is used to execute the data compression method shown in FIG. 2 or the data decompression method shown in FIG. 10,
  • the device includes: at least one processor; and a memory connected in communication with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor It can perform: receiving data to be compressed, the data to be compressed is sparse data output by any layer of the neural network model; performing compression processing on the data to be compressed based on the sparse compression algorithm and bit plane compression algorithm to obtain compressed data; or , Enabling at least one processor to execute: receiving compressed data, the compressed data is generated using the data compression method shown in the above embodiment; using the reverse step of the data compression method shown in the above embodiment to perform the compression of the compressed data Perform decompression processing to restore the compressed data.
  • An embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a program, and when the program is executed by a multi-core processor, the multi-core processor is caused to execute: receive data to be compressed, and the data to be compressed is The sparse data output by any layer of the neural network model; the compressed data is compressed based on the sparse compression algorithm and the bit-plane compression algorithm to obtain the compressed data; or the multi-core processor is made to execute: receive the compressed data, and use the compressed data
  • the data compression method shown in the above embodiment is generated; the reverse step of the data compression method shown in the above embodiment is used to perform decompression processing on the compressed data to restore the compressed data.
  • the device and the computer-readable storage medium provided in the embodiments of the present application correspond to the method one-to-one. Therefore, the device, the device, and the computer-readable storage medium also have beneficial technical effects similar to the corresponding method.
  • the beneficial technical effects are described in detail, and therefore, the beneficial technical effects of the apparatus, equipment and computer-readable storage medium are not repeated here.
  • the embodiments of the present invention can be provided as a method, a system, or a computer program product. Therefore, the present invention may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can direct a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in computer readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technology
  • CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置,其中数据压缩方法包括:接收待压缩数据,所述待压缩数据为神经网络模型的任意层输出的稀疏数据(步骤21);基于稀疏压缩算法和位平面压缩算法对待压缩数据执行压缩编码,得到已压缩数据(步骤22)。利用上述方法,能够实现较高的压缩率,进而能够节省数据传输带宽与外部存储器的存储空间,提高访存效率,提升芯片计算能力。

Description

数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置 技术领域
本发明属于数据压缩、解压缩领域,具体涉及数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置。
背景技术
本部分旨在为权利要求书中陈述的本发明的实施方式提供背景或上下文。此处的描述不因为包括在本部分中就承认是现有技术。
随着神经网络的快速发展,对神经网络模型的识别准确率要求不断地提高,神经网络的尺寸也不断增加。受限于片内存储资源,现有的芯片很难把整个网络的全部参数一次性加载到片内存储中,通常的做法是采用分层分块的计算以解决了片上存储资源不足的问题,但这样会导致增加了片内片外的数据交互,而访存又往往是限制加速器计算能力的主要瓶颈。因此,如何提高访存的效率,是提高加速器计算能力的关键问题。
发明内容
针对上述现有技术难以提高片内片外访存效率的问题。本发明实施例提出了一种数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置。利用这种方法及装置,能够解决上述问题。
本发明的实施例中提供了以下方案。
一方面,提供了一种数据压缩方法,包括:接收待压缩数据,待压缩数据为神经网络模型的任意层输出的稀疏数据;基于稀疏压缩算法和位平面压缩算法对待压缩数据执行压缩处理,得到已压缩数据。
在一些可能的实施方式中,待压缩数据的数据格式为定点类型数据或浮点类型数据。
在一些可能的实施方式中,待压缩数据为高阶张量数据,所执行的压缩处理包括:以行为单位对待压缩数据执行压缩处理。
在一些可能的实施方式中,所执行的压缩处理包括基于稀疏压缩算法的第一压缩处理,第一压缩处理包括:对待压缩数据并行地执行稀疏化压缩,输出第一压缩数据和第一附加信息,其中第一压缩数据形成为紧密排列的非0数据。
在一些可能的实施方式中,第一附加信息包括位图(bitmap)和第一行长度信息,其中位图用于指示待压缩数据中的非0数据的位置,第一行长度信息用于指示待压缩数据中的每行数据在第一压缩处理之后的数据长度。
在一些可能的实施方式中,所执行的压缩处理包括基于位平面压缩算法的第二压缩处理,第二压缩处理包括:对待压缩数据或第一压缩数据进行分包,得到多个数据包;对多个数据包分别进行位平面压缩预处理;将经过位平面压缩预处理后的多个数据包分发给多路编码器以并行地执行位平面压缩处理,得到多个BPC编码包;将多个BPC编码包合并以输出第二压缩数据和第二附加信息。
在一些可能的实施方式中,第二附加信息包括:第二行长度信息和分包长度信息,其中,第二行长度信息用于指示待压缩数据中的每行数据在第二压缩处理之后的数据长度,分包长度信息用于指示各个BPC编码包的数据长度。
在一些可能的实施方式中,,还包括:利用多路复用机制对多个数据包分别进行位平面压缩预处理。
在一些可能的实施方式中,对待压缩数据或第一压缩数据进行分包,还包括:根据预设长度对待压缩数据或第一压缩数据进行分包;其中,若多个数据包中的最后一包不足预设长度,则不压缩多个数据包中的最后一包;或对多个数据包中的最后一包进行补0处理。
在一些可能的实施方式中,基于稀疏压缩算法和位平面压缩算法对待压缩数据执行压缩处理,还包括:响应于第一使能指令,顺序地执行第一压缩处理与第二压缩处理,输出第二压缩数据、第一附加信息和第二附加信息作为已压缩数据;响应于第二使能指令,单独执行第一压缩处理,输出第一压缩数据、 第一附加信息作为已压缩数据;响应于第三使能指令,单独执行第二压缩处理,输出第二压缩数据、第二附加信息作为已压缩数据。
第二方面,提供一种数据解压缩方法,包括:接收已压缩数据,已压缩数据利用如权利要求1-10中任一项的数据压缩方法而生成;利用如权利要求1-8中任一项的数据压缩方法的逆步骤对已压缩数据执行解压缩处理,以复原已压缩数据。
第三方面,提供一种基于数据压缩和解压缩的处理方法,包括:利用如权利要求1-10中任一项的方法对待压缩数据执行压缩处理,得到已压缩数据,并将已压缩数据传输并存储于外部存储器;获取存储在外部存储器的已压缩数据,并利用如权利要求11的方法对已压缩数据执行解压缩处理,将已压缩数据复原为稀疏数据,并将复原得到的稀疏数据输入神经网络模型以执行运算。
第四方面,提供一种数据压缩装置,包括:接收模块,用于接收待压缩数据,待压缩数据为神经网络模型的任意层输出的稀疏数据;压缩模块,用于基于稀疏压缩算法和位平面压缩算法对待压缩数据执行压缩处理,得到已压缩数据。
在一些可能的实施方式中,待压缩数据的数据格式为定点类型数据或浮点类型数据。
在一些可能的实施方式中,待压缩数据为高阶张量数据,压缩模块还用于:以行为单位对待压缩数据执行压缩处理。
在一些可能的实施方式中,压缩模块还包括基于稀疏压缩算法的第一压缩单元,用于:对待压缩数据并行地执行稀疏化压缩,输出第一压缩数据和第一附加信息,其中第一压缩数据形成为紧密排列的非0数据。
在一些可能的实施方式中,第一附加信息包括位图(bitmap)和第一行长度信息,其中位图用于指示待压缩数据中的非0数据的位置,第一行长度信息用于指示待压缩数据中的每行数据在第一压缩处理之后的数据长度。
在一些可能的实施方式中,压缩模块还包括基于位平面压缩算法的第二压缩单元,用于:对待压缩数据或第一压缩数据进行分包,得到多个数据包;对 多个数据包分别进行位平面压缩预处理;将经过位平面压缩预处理后的多个数据包分发给多路编码器以并行地执行位平面压缩处理,得到多个BPC编码包;将多个BPC编码包合并以输出第二压缩数据和第二附加信息。
在一些可能的实施方式中,第二附加信息包括:第二行长度信息和分包长度信息,其中,第二行长度信息用于指示待压缩数据中的每行数据在第二压缩处理之后的数据长度,分包长度信息用于指示各个BPC编码包的数据长度。
在一些可能的实施方式中,第二压缩单元还用于:利用多路复用机制对多个数据包分别进行位平面压缩预处理。
在一些可能的实施方式中,第二压缩单元还用于:根据预设长度对待压缩数据或第一压缩数据进行分包;其中,若多个数据包中的最后一包不足预设长度,则不压缩多个数据包中的最后一包,或对多个数据包中的最后一包进行补0处理。
在一些可能的实施方式中,压缩模块还用于:响应于第一使能指令,顺序地执行第一压缩处理与第二压缩处理,输出第二压缩数据、第一附加信息和第二附加信息作为已压缩数据;响应于第二使能指令,单独执行第一压缩处理,输出第一压缩数据、第一附加信息作为已压缩数据;响应于第三使能指令,单独执行第二压缩处理,输出第二压缩数据、第二附加信息作为已压缩数据。
第五方面,提供一种数据解压缩装置,包括:获取模块,用于获取已压缩数据,已压缩数据利用如第一方面中的数据压缩方法而生成;解压缩模块,用于利用如权利要求1-8中任一项的数据压缩方法的逆步骤对已压缩数据执行解压缩处理,以复原已压缩数据。
第六方面,提供一种基于数据压缩和解压缩的神经网络处理装置,包括:数据压缩装置,用于利用如第一方面的方法对待压缩数据执行压缩处理以得到已压缩数据,并将已压缩数据传输并存储于外部存储器;数据解压缩装置,用于获取存储在外部存储器的已压缩数据,并利用如第二方面的方法对已压缩数据执行解压缩处理,从而将已压缩数据复原为稀疏数据,并将复原得到的稀疏数据输入神经网络模型以执行运算。
本申请实施例采用的上述至少一个技术方案能够达到以下有益效果:通过采用稀疏压缩算法和位平面压缩算法对待压缩数据进行压缩处理,能够实现较高的压缩率,进而能够节省数据传输带宽与外部存储器的存储空间,提高访存效率,提升芯片计算能力。
应当理解,上述说明仅是本发明技术方案的概述,以便能够更清楚地了解本发明的技术手段,从而可依照说明书的内容予以实施。为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举例说明本发明的具体实施方式。
附图说明
通过阅读下文的示例性实施例的详细描述,本领域普通技术人员将明白本文所述的优点和益处以及其他优点和益处。附图仅用于示出示例性实施例的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的标号表示相同的部件。在附图中:
图1为示意性的神经网络芯片的结构示意图;
图2为根据本发明一实施例的数据压缩方法的流程示意图;
图3为根据本发明一实施例的数据压缩示意图;
图4为根据本发明一实施例的基于稀疏压缩算法的第一压缩处理的示意图;
图5为根据本发明另一实施例的数据压缩方法的流程示意图;
图6为根据本发明一实施例的基于位平面压缩算法的第二压缩处理的示意图;
图7为根据本发明一实施例的第一位平面压缩预处理示意图;
图8为根据本发明一实施例的第二位平面压缩预处理示意图;
图9为根据本发明一实施例的第三位平面压缩预处理示意图;
图10为根据本发明一实施例的数据解压缩方法的流程示意图;
图11为根据本发明一实施例的数据压缩装置的装置示意图;
图12为根据本发明一实施例的数据解压缩装置的装置示意图;
图13为根据本发明一实施例的基于数据压缩和解压缩的处理装置的结构示意图;
图14为根据本发明一实施例的数据压缩、或解压缩装置的结构示意图。
在附图中,相同或对应的标号表示相同或对应的部分。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
在本发明中,应理解,诸如“包括”或“具有”等术语旨在指示本说明书中所公开的特征、数字、步骤、行为、部件、部分或其组合的存在,并且不旨在排除一个或多个其他特征、数字、步骤、行为、部件、部分或其组合存在的可能性。
另外还需要说明的是,在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。
图1是一种示意性的神经网络芯片10的结构示意图。
其中神经网络处理单元11主要用于神经网络的计算,具体可以包括运算单元12以及内部存储器13。内部存储器13通常采用SRAM(Static Random-Access Memory,静态随机存取存储器),且由于其成本较高,实际应用中通常避免采用大容量的内部存储器。该神经网络芯片10还包括电连接至神经网络处理单元11的外部存储器14,其通常采用成本相对较低的DRAM(Dynamic Random Access Memory,动态随机存取存储器)、DDR(Double Data Rate SDRAM=双倍速率同步动态随机存储器,简称为DDR)等,用于存储较大容量数据。
可以理解,神经网络模型可以部署在运算单元12上进行数据处理。神经网络模型包括多个层,在实际的神经网络运算过程中,需要存储神经网络模型的各个层输出的中间数据,并在后续层的运算过程中重新使用所存储的中间数 据。而由于神经网络处理单元11的内部存储器13的存储空间有限,通常需要将上述各个层输出的中间数据存储到外部存储器14,并在后续需要时再从外部存储器14读取。
图2示出了根据本发明实施例的数据压缩方法20的方法流程示意图。
如图2所示,该方法20包括:
步骤21:接收待压缩数据,所述待压缩数据为神经网络模型的任意层输出的稀疏数据;
步骤22:基于稀疏压缩算法和位平面压缩算法对待压缩数据执行压缩处理,得到已压缩数据。
待压缩数据可以是在神经网络运算过程中向神经网络模型输入图像数据之后,再由神经网络模型的任意层输出的中间数据。具体而言,中间数据可以是图像数据的特征图(feature map)数据,特征图(feature map)数据通常会呈现稀疏化特性,比如会出现大量0值或者经过特定运算后会得到大量确定值比如全0或全1等。基于此,若上述呈现稀疏化特性的中间数据需要传输并存储到外部存储器,可以将其作为待压缩数据,并采用基于稀疏压缩算法和位平面压缩算法对待压缩数据进行压缩处理,并把压缩后得到的已压缩数据传输并存储到外部存储器中,这样能够节省数据传输带宽与外部存储器的存储空间,同时可以降低芯片的成本和功耗,提高访存效率,进而提升芯片的计算能力。
具体而言,稀疏压缩(sparse compression,以下简称SC)算法用于执行稀疏化压缩,属于无损压缩算法,其原理在于可以将待压缩数据中的非0值取出并按先后顺序紧密组合在一起,同时输出位图以指示待压缩数据中的非0值分布位置。位平面压缩(Bit Plane Compression,以下简称BPC)算法同样属于无损压缩算法,其至少包括BPC预处理和BPC编码,其中,BPC预处理通过相邻数减法、矩阵转置、数据异或等操作来增加数据的可压缩性,然后,BPC编码根据BPC编码规则对经过BPC预处理的数据执行压缩处理。
举例来说,如图3所示,可以采用基于SC算法和BPC算法的两级压缩处理对待压缩数据执行压缩处理。比如,若待压缩数据为张量数据,可以将待压 缩数据中的每行数据依次输入上述第一压缩单元,由第一压缩单元基于SC算法执行压缩处理,进而输出第一压缩数据(紧密排列的非0值)、位图以及第一压缩处理后的每行数据的行长度信息。进一步地,还可以将第一压缩数据输入第二压缩单元,由第二压缩单元基于BPC算法执行第二压缩处理,进而输出第二压缩数据以及第二压缩处理后每行数据的行长度信息等。
可选地,也可以单独基于SC算法或基于BPC算法而执行压缩处理。比如由可以将待压缩数据中的每行数据依次输入上述第一压缩单元,并输出第一压缩数据(紧密排列的非0值)、位图以及第一压缩处理后的每行数据的行长度信息作为已压缩数据。还可以将待压缩数据中的每行数据依次输入上述第二压缩单元,由第二压缩单元基于BPC算法执行第二压缩处理,进而输出第二压缩数据以及第二压缩处理后每行数据的行长度信息等作为已压缩数据。
在一些可能的实施方式中,待压缩数据的数据格式为定点类型数据或浮点类型数据。
具体地,待压缩数据可以由若干个定点类型数据或若干个浮点类型数据所组成,比如可以由若干个16bit浮点数或者8bit定点数组成。举例来说,假设待压缩数据由若干个16bit浮点数组成,图3所示出第一压缩单元和第二压缩单元的输入位宽可以是相同的,比如是128bit,那么每个时钟周期可以并行输入8个16bit浮点数,从而支持执行并行压缩处理。
在一些可能的实施方式中,其中待压缩数据为高阶张量数据,并且在步骤22中所执行的压缩处理包括:以行为单位对所述待压缩数据执行所述压缩处理。
具体地,高阶张量数据具体是指神经网络运算过程中由各网络层输出的特征图(feature map),其可以形成为二阶、三阶或四阶。比如,特征图可以是一个包括通道数、行数、行宽三个维度的三维张量,其尺寸可以表示为c(通道数)*h(行数)*w(行宽),在执行压缩处理时,以行为单位并采用逐行逐通道的顺序依次压缩待压缩的特征图中的全部数据。又比如,特征图可以是一个包括帧数、通道数、行数、行宽四个维度的四维张量,其尺寸可以表示为n(帧 数)*c(通道数)*h(行数)*w(行宽),在执行压缩处理时,以行为单位并采用逐行逐通道逐帧的顺序依次压缩待压缩的特征图中的全部数据。
在一些可能的实施方式中,在步骤22中,所执行的压缩处理包括基于所述稀疏压缩算法的第一压缩处理,第一压缩处理包括:对待压缩数据并行地执行稀疏化压缩,输出第一压缩数据和第一附加信息,其中所述第一压缩数据可以形成为紧密排列的非0数据。
在一些可能的实施方式中,第一附加信息可以包括位图(bitmap)和第一行长度信息,其中所述位图用于指示所述待压缩数据中的非0数据的位置,所述第一行长度信息用于指示所述待压缩数据中的每行数据在所述第一压缩处理之后的数据长度。
以下结合图4对上述第一压缩处理进行详细介绍:
如图4所示,当以行为单位对待压缩数据进行压缩处理时,假设待压缩的当前行是图4中的d 0~d 31,每个数据的格式为16bit浮点数(以下简称bf16),因此每个时钟周期内可以并行向图3中的第一压缩单元输入8个数据,比如第一个时钟周期送入d 0~d 7,第二个时钟周期送入d 8~d 15,依次送完全部当前行数据d 0~d 31,并拉高输入完成信号以告知第一压缩单元数据输入完成,第一压缩单元可以并行处理8个数据的稀疏压缩,将其中的非0值挑出并紧密排列地存入缓存中,每当缓存中存满8个非0值或压缩结束时则输出有效信号。与此同时,第一压缩单元并行将8bit的位图(bitmap)也存入缓存中,每当存满64bit或压缩结束则输出有效信号。在位图中,每个比特用来指示待压缩数据在对应数据位是否为0值,比如在图4中,d 7为非0值则位图中对应于d 7的数据位置1,d 6为0值则位图中对应于d 6的数据位置0。在待压缩数据d 0~d 31均完成压缩后,输出上述紧密排列的非0值数据作为第一压缩数据,并输出位图和第一行长度信息,其中所述第一行长度信息用于指示所述待压缩数据中的每行数据在所述第一压缩处理之后的数据长度,以及对应于当前行的第一压缩数据的总比特数,比如在图4中,所示出的第一行长度信息为12(非0值个数) *16bit=192bit,其中第一行长度信息本身占用16bit。由此可以看出,当前行的待压缩数据的原始数据尺寸为32*16bit=512bit,其中有12个非0值和20个0值,经过基于SC压缩算法的第一压缩处理后,尺寸变为12*16bit(当前行的第一压缩数据)+32*1bit(当前行的位图)+16bit(当前行的第一行长度信息)=240bit,当原始数据中的0值越多,则压缩率越低,压缩效果越好。
可选地,考虑到流水线速率匹配和芯片总线位宽等因素,上述第一压缩处理的输出位宽可以与输入位宽的保持一致,比如均为128bit,也即可以并行输入或输出8个16bit浮点数,或者可以并行输入或输出16个8bit定点数。可选地,位图的输出位宽可以与总线位宽保持一致,比如为64bit。
在一些可能的实施方式中,在步骤22中,所执行的压缩处理包括基于所述位平面压缩算法的第二压缩处理。
如图5所示,上述第二压缩处理包括:
步骤51:对所述待压缩数据或所述第一压缩数据进行分包,得到多个数据包;
步骤52:对所述多个数据包分别进行位平面压缩预处理;
步骤53:将经过所述位平面压缩预处理后的多个数据包分发给多路编码器以并行地执行位平面压缩处理,得到多个BPC编码包;
步骤54:将所述多个BPC编码包合并以输出第二压缩数据和第二附加信息。
具体地,在步骤51中,第二压缩处理的对象可以是待压缩数据,也可以是经过第一压缩处理后输出的第一压缩数据,本申请实施例对此不作具体限制。此外,考虑到流水线速率匹配和芯片总线位宽等因素,第二压缩处理的输入位宽可以与第一压缩处理的输出位宽保持一致,比如均为128bit。第二压缩处理可以同样支持定点类型数据或浮点类型数据的输入,比如可以支持16bit浮点数或8bit定点数等多种格式。举例来说,可以向图3中示出的第二压缩单元并行输入8个16bit浮点数,或者可以并行输入16个8bit定点数。
在一些可能的实施方式中,上述步骤51还可以包括:根据预设长度对所述待压缩数据或所述第一压缩数据进行分包;其中,若所述多个数据包中的最 后一包不足所述预设长度,则不压缩所述多个数据包中的最后一包;或对所述多个数据包中的最后一包进行补0处理。举例来说,如图6所示,就可以将数据集合d 0~d 98按照每16个数据分为一组,得到6个数据包,比如package0:d 0~d 15,package1:d 16~d 31,依次类推。其中,由于package5:d 96~d 98不满16个点,为了避免产生误差,可以对其采取不压缩处理或空位补0处理。其中,补0处理的设计简单,而不压缩处理则不会引入无效数据,效果较好。
具体地,在步骤52中,位平面压缩预处理通常可以包括用于执行相邻数减法操作的第一BPC预处理,用于执行矩阵转置操作的第二BPC预处理,以及用于执行相邻数异或操作的第三BPC预处理,从而增加每个数据包的可压缩性。
在一些可能的实施方式中,上述步骤52还可以包括:利用多路复用机制对所述多个数据包分别进行所述位平面压缩预处理。具体地,由于BPC预处理的处理速度快,而BPC编码的速度较慢,因此可以采用多路复用机制,也即多路编码器可以共用一个用于执行BPC预处理的装置,并在BPC预处理完成按顺序将多个经过BPC预处理的数据包分发给多个编码器。
具体地,在步骤53中,由于基于BPC算法的第二压缩处理是一个前后数据有依赖关系的纯串行编码过程,难以满足芯片与外部存储器交互的速率要求,因此可以采用多路编码器并行处理的方式,上述并行处理是指将数据进行分包后再分发给不同的编码器以执行并行的BPC编码,最后将BPC编码后的数据合并后输出,进而可以满足高速率的要求。由此需要额外增加分包控制逻辑以避免数据混乱,以及需要获知分包长度信息,其中该分包长度信息用于指示每个分包在经过BPC编码后的比特数。
以下结合图6对上述步骤51~步骤54进行详细说明:
如图6所示,假设当前行的待压缩数据形成为图6中的d 0~d 98(也即data 0~data 98);或者当前行的待压缩数据在经过第一压缩处理之后,所输出的第一压缩数据形成为图6中的d 0~d 98(也即data 0~data 98),则对该d 0~d 98执行第二压缩处理具体可以包括:
首先,可以执行步骤51,将d 0~d 98中的每16个数据分为一组,得到6个数据包,比如package0:d 0~d 15,package1:d 16~d 31,…,package5:d 96~d 98,依次类推。
接下来,可以执行步骤52,对于上述6个数据包依次执行BPC预处理。其中,BPC预处理具体可以包括第一BPC预处理、第二BPC预处理以及第三BPC预处理。图7~图9分别示出了上述三种BPC预处理的数据处理过程。
以下结合图7~图9并以package0:d 0~d 15(也即data 0~data 15)为例对BPC预处理过程进行详细阐述:
如图7所示,第一BPC预处理包括:选取package0的第一个数据data 0作为基底base,并利用公式:delta n=data n-data n-1使剩下的数据依次与相邻数做减法,得到(delta 1,...,delta 15),其中n为1~15之间的正整数。为了保证不溢出,两个16bit数据相减得到17bit的减法结果,进而可以得到16bit的base和15个17bit的减法结果(delta 1,...,delta 15)。
如图8所示,第二BPC预处理包括:将(delta 1,...,delta 15)看做一个17bit*15的数据矩阵,对该数据矩阵做转置操作,可以得到一个15bit*17的新数据块,并依次将这17个15bit数据定义为(DPB 0,...,DPB 16),由此可以得到16bit的base和17个15bit的DBP数据。
如图9所示,第三BPC预处理包括:将得到的DBP数据依次与相邻DBP做异或操作,得到DBX数据,既DPB 0与DPB 1异或得到DBX 0,……,DPB 15与DPB 16异或得到DPX 15,因为DPB 16是最后一个数据,没有相邻数据与其作异或,所以令DBX 16=DPB 16,经过此步运算后,得到16bit的base和17个15bit的DBX数据(DBX 0,...,DBX 16),至此,完成对package0:(data 0,...,data 15)的BPC预处理。
接下来,可以执行步骤53,将上述BPC预处理后的6个数据包分发给多路编码器(图中为16路)中的6路编码器以执行并行编码。如图6所示,可以将 BPC预处理后的package0~6分别发送给编码器0~5,并由6路编码器并行执行基于BPC编码规则的BPC编码,最后并行输出6个BPC编码包。
表1:BPC编码规则表
BASE/DBX/DBP length(bit) code(binary)
base 17 {1’b1,base}
0(run length 2-17) 6 {2’b01,(runlength-2)[3:0]}
0(run length 1) 3 {3’b001}
all 1’s 5 {5’b00000}
DBX!=0&DBP=0 5 {5’b00001}
consecutive two 1’s 9 {5’b00010,StartingOnePosition[3:0]}
single 1 9 {5’b00011,OnePosition[3:0]}
uncompressed 16 {1’b1,DBX}
各个编码器按照如表1所示的BPC编码规则表对数据包执行编码。
上述BPC编码规则表示出了一个串行的编码过程。举例来说,如图7~9所示,每个数据包经过BPC预处理之后包含18个数据编码(1个base+17个DBX),当对每个经过BPC预处理之后的数据包进行BPC编码时均需要18个时钟周期。基于此,若执行串行的BPC编码,完成package0~6的BPC编码需要18×6个时钟周期,因此为满足处理速度要求本实施例可以采用并行的多路编码器执行并行编码处理,从而得到更高的处理速度。
接下来,可以执行步骤54在BPC编码结束后,按照原有分包逻辑将上述并行输出的6个BPC编码包进行合并转为串行数据,输出第二压缩数据(合并后得到的串行数据)以及第二附加信息。
在一些可能的实施方式中,步骤54中的第二附加信息可以包括:第二行长度信息和分包长度信息,其中,所述第二行长度信息用于指示待压缩数据中的每行数据在所述第二压缩处理之后的数据长度,分包长度信息用于指示各个BPC编码包的数据长度,从而便于解压缩时的并行处理。
在一些可能的实施方式中,上述步骤22具体可以包括:响应于第一使能指令,顺序地执行所述第一压缩处理与所述第二压缩处理,输出第二压缩数据、第一附加信息和所述第二附加信息作为所述已压缩数据。在另外一些实施方式中,上述步骤22还可以包括:响应于第二使能指令单独执行所述第一压缩处理,输出所述第一压缩数据、所述第一附加信息作为所述已压缩数据;或者响应于 第三使能指令单独执行所述第二压缩处理,输出所述第二压缩数据、所述第二附加信息作为所述已压缩数据。
表2:YoloV2-Relu网络的各层压缩率
层数 BPC压缩率(%) SC压缩率(%) 总压缩率(%)
0 44.7817 48.2235 27.8453031
1 47.9434 52.1857 31.26959889
2 47.4407 69.4227 39.18461484
3 52.2488 79.7526 47.91977647
4 54.5907 58.2833 38.06726145
5 56.6316 65.2384 43.19554973
6 54.3179 38.5487 27.18884432
7 65.2585 56.8986 43.38117288
8 66.8773 41.2468 33.83474618
9 67.2865 53.8382 42.47584044
10 71.3162 18.7043 19.589196
11 77.5811 32.7226 31.63655303
12 80.6968 23.5358 25.24263745
13 74.2686 41.4905 37.06441348
14 78.0405 34.7637 33.3797653
15 80.1174 38.1705 36.83121217
16 86.8113 13.2722 17.77176936
17 91.4107 25.9872 30.00508143
18 96.6871 19.2184 24.83171363
19 96.192 26.2599 31.50992301
20 97.2372 18.854 24.57960115
21 99.5344 11.2598 17.45737437
22 99.1212 22.0608 28.11692969
23 96.9902 33.3507 38.59691063
24 96.1162 33.0696 38.03524288
25 86.8113 13.2722 17.77176936
26 71.3723 80.3276 63.58165565
27 76.3921 80.3276 67.61394052
28 88.664 42.5212 43.95099677
29 88.511 51.7377 52.03843361
30 79.3646 100 85.6146
31 78.997 100 85.247
平均 54.5658 49.131 33.0587232
具体地,表2以YoloV2-Relu网络为例,以随机选取50张图像输入得到的平均值为例,分别列出单独执行第一压缩处理(SC编码)、单独执行第二压缩处理(BPC编码)和顺序地执行第一压缩处理(SC编码)和第二压缩处理(BPC编码)时,三种不同压缩处理方案下针对各网络层的压缩率。从上表可以看出,当同时采用两种压缩算法时,平均压缩率可以达到33%,这就意味着可以减少70%左右的数据传输,这不仅可以减少数据传输的时间,也节省了与外部存储器进行交互时的带宽资源,进而可以将带宽资源更加合理的分配给神经网络处理装置中的其他单元,可以大幅提升装置性能。同时,并行化的设计使得压缩速度得到保证,上下游模块的压缩速率匹配。
基于相同或类似的技术构思,本发明实施例还提供一种数据解压缩方法100,如图10所示,包括:
步骤101:接收已压缩数据,其中该已压缩数据是利用上述实施例中所示出的数据压缩方法而生成;
步骤102:利用上述实施例中所示出的数据压缩方法的逆步骤,对已压缩数据执行解压缩处理以复原已压缩数据。
可以理解,解压缩处理与压缩处理互为逆处理,本实施例中的数据解压缩采用与上述数据压缩方法的实施例中各个方面完全相逆的过程,并获得相应的技术效果,在此不再赘述。
基于相同或类似的技术构思,本发明实施例还提供基于数据压缩和解压缩的处理方法,包括:
利用上述实施例中所示出的数据压缩方法对待压缩数据执行压缩处理,得到已压缩数据,并将所述已压缩数据传输并存储于外部存储器;
获取存储在所述外部存储器的所述已压缩数据,并利用上述实施例中所示出的数据解压缩方法对已压缩数据执行解压缩处理,将所述已压缩数据复原为所述稀疏数据,并将所述复原得到的稀疏数据输入所述神经网络模型以执行运算。
基于相同或类似的技术构思,本发明实施例还提供一种数据压缩装置110,如图11所示,数据压缩装置110包括:
接收模块111,用于接收待压缩数据,待压缩数据为神经网络模型的任意层输出的稀疏数据;
压缩模块112,用于基于稀疏压缩算法和位平面压缩算法对待压缩数据执行压缩处理,得到已压缩数据。
在一些可能的实施方式中,待压缩数据的数据格式为定点类型数据或浮点类型数据。
在一些可能的实施方式中,待压缩数据为高阶张量数据,压缩模块还用于:以行为单位对待压缩数据执行压缩处理。
在一些可能的实施方式中,压缩模块112还包括基于稀疏压缩算法的第一压缩单元,用于:对待压缩数据并行地执行稀疏化压缩,输出第一压缩数据和第一附加信息,其中第一压缩数据形成为紧密排列的非0数据。
在一些可能的实施方式中,第一附加信息包括位图(bitmap)和第一行长度信息,其中位图用于指示待压缩数据中的非0数据的位置,第一行长度信息用于指示待压缩数据中的每行数据在第一压缩处理之后的数据长度。
在一些可能的实施方式中,压缩模块112还包括基于位平面压缩算法的第二压缩单元,用于:对待压缩数据或第一压缩数据进行分包,得到多个数据包;对多个数据包分别进行位平面压缩预处理;将经过位平面压缩预处理后的多个数据包分发给多路编码器以并行地执行位平面压缩处理,得到多个BPC编码包;将多个BPC编码包合并以输出第二压缩数据和第二附加信息。
在一些可能的实施方式中,第二附加信息包括:第二行长度信息和分包长度信息,其中,第二行长度信息用于指示待压缩数据中的每行数据在第二压缩处理之后的数据长度,分包长度信息用于指示各个BPC编码包的数据长度。
在一些可能的实施方式中,第二压缩单元还用于:利用多路复用机制对多个数据包分别进行位平面压缩预处理。
在一些可能的实施方式中,第二压缩单元还用于:根据预设长度对待压缩数据或第一压缩数据进行分包;其中,若多个数据包中的最后一包不足预设长度,则不压缩多个数据包中的最后一包,或对多个数据包中的最后一包进行补0处理。
在一些可能的实施方式中,压缩模块112还用于:响应于第一使能指令,顺序地执行第一压缩处理与第二压缩处理,输出第二压缩数据、第一附加信息和第二附加信息作为已压缩数据;响应于第二使能指令,单独执行第一压缩处理,输出第一压缩数据、第一附加信息作为已压缩数据;响应于第三使能指令,单独执行第二压缩处理,输出第二压缩数据、第二附加信息作为已压缩数据。
基于相同或类似的技术构思,本发明实施例还提供一种数据解压缩装置120,如图12所示,数据解压缩装置120包括:
获取模块121,用于获取已压缩数据,已压缩数据利用如上述实施例所示出数据压缩方法而生成;
解压缩模块122,用于利用如上述实施例所示出数据压缩方法的逆步骤对已压缩数据执行解压缩处理,以复原已压缩数据。
基于相同或类似的技术构思,本发明实施例还提供一种基于数据压缩和解压缩的处理装置130,如图13所示,包括:
数据压缩装置131,用于从运算单元12中获取神经网络模型的任意层输出的稀疏数据作为待压缩数据,利用如上述实施例中所示出的数据压缩方法对待压缩数据执行压缩处理以得到已压缩数据,并将已压缩数据传输并存储于外部存储器14;
数据解压缩装置132,用于获取存储在外部存储器14的已压缩数据,并利用如上述实施例中所示出的数据解压缩方法对已压缩数据执行解压缩处理,从而将已压缩数据复原为稀疏数据,并将复原得到的稀疏数据输入运算单元12中的神经网络模型以执行运算。
通过上述处理装置,由于数据压缩的压缩率高,能够显著节省数据传输带宽与外部存储器的存储空间,提高访存效率。同时可以降低芯片的成本和功耗,进而可提升处理装置的计算能力。
图14为根据本申请实施例的一种数据压缩或解压缩装置的示意图,用于执行如图2所示出的数据压缩方法,或用于执行如图10所示出的数据解压缩方法,该装置包括:至少一个处理器;以及,与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行:接收待压缩数据,所述待压缩数据为神经网络模型的任意层输出的稀疏数据;基于稀疏压缩算法和位平面压缩算法对所述待压缩数据执行压缩处理,得到已压缩数据;或者,使至少一个处理器能够执行:接收已压缩数据,已压缩数据利用上述实施例所示出的数据压缩方法而生成;利用如上述实施例所示出的数据压缩方法的逆步骤对已压缩数据执行解压缩处理,以复原已压缩数据。
本申请一实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有程序,当程序被多核处理器执行时,使得该多核处理器执行:接收待压缩数据,待压缩数据为神经网络模型的任意层输出的稀疏数据;基于稀疏压缩算法和位平面压缩算法对待压缩数据执行压缩处理,得到已压缩数据;或者,使得该多核处理器执行:接收已压缩数据,已压缩数据利用上述实施例所示出的数据压缩方法而生成;利用如上述实施例所示出的数据压缩方法的逆步骤对已压缩数据执行解压缩处理,以复原已压缩数据。
本申请中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置、设备和计算机可读存储介质实施例而言,由于其基本相似于 方法实施例,所以其描述进行了简化,相关之处可参见方法实施例的部分说明即可。
本申请实施例提供的装置和计算机可读存储介质与方法是一一对应的,因此,装置、设备和计算机可读存储介质也具有与其对应的方法类似的有益技术效果,由于上面已经对方法的有益技术效果进行了详细说明,因此,这里不再赘述装置、设备和计算机可读存储介质的有益技术效果。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。此外,尽管在附图中以特定顺序描述了本发明方法的操作,但是,这并非要求或者暗示必须按照该特定顺序来执行这些操作,或是必须执行全部所示的操作才能实现期望的结果。附加地或备选地,可以省略某些步骤,将多个步骤合并为一个步骤执行,和/或将一个步骤分解为多个步骤执行。
虽然已经参考若干具体实施方式描述了本发明的精神和原理,但是应该理解,本发明并不限于所公开的具体实施方式,对各方面的划分也不意味着这些方面中的特征不能组合以进行受益,这种划分仅是为了表述的方便。本发明旨在涵盖所附权利要求的精神和范围内所包括的各种修改和等同布置。
虽然已经参考若干具体实施方式描述了本发明的精神和原理,但是应该理解,本发明并不限于所公开的具体实施方式,对各方面的划分也不意味着这些方面中的特征不能组合以进行受益,这种划分仅是为了表述的方便。本发明旨在涵盖所附权利要求的精神和范围内所包括的各种修改和等同布置。

Claims (24)

  1. 一种数据压缩方法,其特征在于,包括:
    接收待压缩数据,所述待压缩数据为神经网络模型的任意层输出的稀疏数据;
    基于稀疏压缩算法和位平面压缩算法对所述待压缩数据执行压缩处理,得到已压缩数据。
  2. 根据权利要求1所述的方法,其特征在于,所述待压缩数据的数据格式为定点类型数据或浮点类型数据。
  3. 根据权利要求1所述的方法,其特征在于,所述待压缩数据为高阶张量数据,所执行的压缩处理包括:且以行为单位对所述待压缩数据执行所述压缩处理。
  4. 由权利要求1所述的方法,其特征在于,所执行的压缩处理包括基于所述稀疏压缩算法的第一压缩处理,包括:
    对待压缩数据并行地执行稀疏化压缩,输出第一压缩数据和第一附加信息,其中所述第一压缩数据形成为紧密排列的非0数据。
  5. 由权利要求4所述的方法,其特征在于,所述第一附加信息包括位图(bitmap)和第一行长度信息,其中所述位图用于指示所述待压缩数据中的非0数据的位置,所述第一行长度信息用于指示所述待压缩数据中的每行数据在所述第一压缩处理之后的数据长度。
  6. 由权利要求4所述的方法,其特征在于,所执行的压缩处理包括基于所述位平面压缩算法的第二压缩处理,包括:
    对所述待压缩数据或所述第一压缩数据进行分包,得到多个数据包;
    对所述多个数据包分别进行位平面压缩预处理;
    将经过所述位平面压缩预处理后的多个数据包分发给多路编码器以并行地执行位平面压缩处理,得到多个BPC编码包;
    将所述多个BPC编码包合并以输出第二压缩数据和第二附加信息。
  7. 由权利要求6所述的方法,其特征在于,所述第二附加信息包括:
    第二行长度信息和分包长度信息,其中,所述第二行长度信息用于指示所述待压缩数据中的每行数据在所述第二压缩处理之后的数据长度,所述分包长度信息用于指示各个所述BPC编码包的数据长度。
  8. 由权利要求6所述的方法,其特征在于,还包括:利用多路复用机制对所述多个数据包分别进行所述位平面压缩预处理。
  9. 由权利要求6所述的方法,其特征在于,对所述待压缩数据或所述第一压缩数据进行分包,还包括:
    根据预设长度对所述待压缩数据或所述第一压缩数据进行分包;
    其中,若所述多个数据包中的最后一包不足所述预设长度,则不压缩所述多个数据包中的最后一包;或对所述多个数据包中的最后一包进行补0处理。
  10. 由权利要求3~9中任一项所述的方法,其特征在于,基于稀疏压缩算法和位平面压缩算法对所述待压缩数据执行压缩处理,还包括:
    响应于第一使能指令,顺序地执行所述第一压缩处理与所述第二压缩处理,输出第二压缩数据、所述第一附加信息和所述第二附加信息作为所述已压缩数据;
    响应于第二使能指令,单独执行所述第一压缩处理,输出所述第一压缩数据、所述第一附加信息作为所述已压缩数据;
    响应于第三使能指令,单独执行所述第二压缩处理,输出所述第二压缩数据、所述第二附加信息作为所述已压缩数据。
  11. 一种数据解压缩方法,其特征在于,包括:
    接收已压缩数据,所述已压缩数据利用如权利要求1-10中任一项所述的数据压缩方法而生成;
    利用如权利要求1-10中任一项所述的数据压缩方法的逆步骤对所述已压缩数据执行解压缩处理,以复原所述已压缩数据。
  12. 一种基于数据压缩和解压缩的处理方法,其特征在于,包括:
    利用如权利要求1-10中任一项所述的方法对待压缩数据执行压缩处理,得到已压缩数据,并将所述已压缩数据传输并存储于外部存储器;
    获取存储在所述外部存储器的所述已压缩数据,并利用如权利要求11所述的方法对所述已压缩数据执行解压缩处理,将所述已压缩数据复原为所述稀疏数据,并将所述复原得到的稀疏数据输入所述神经网络模型以执行运算。
  13. 一种数据压缩装置,其特征在于,包括:
    接收模块,用于接收待压缩数据,所述待压缩数据为神经网络模型的任意层输出的稀疏数据;
    压缩模块,用于基于稀疏压缩算法和位平面压缩算法对所述待压缩数据执行压缩处理,得到已压缩数据。
  14. 根据权利要求13所述的装置,其特征在于,所述待压缩数据的数据格式为定点类型数据或浮点类型数据。
  15. 根据权利要求13所述的装置,其特征在于,所述待压缩数据为高阶张量数据,所述压缩模块还用于:以行为单位对所述待压缩数据执行所述压缩处理。
  16. 由权利要求13所述的装置,其特征在于,所述压缩模块还包括基于所述稀疏压缩算法的第一压缩单元,用于:对待压缩数据并行地执行稀疏化压缩,输出第一压缩数据和第一附加信息,其中所述第一压缩数据形成为紧密排列的非0数据。
  17. 由权利要求16所述的装置,其特征在于,所述第一附加信息包括位图(bitmap)和第一行长度信息,其中所述位图用于指示所述待压缩数据中的非0数据的位置,所述第一行长度信息用于指示所述待压缩数据中的每行数据在所述第一压缩处理之后的数据长度。
  18. 由权利要求16所述的装置,其特征在于,所述压缩模块还包括基于所述位平面压缩算法的第二压缩单元,用于:
    对所述待压缩数据或所述第一压缩数据进行分包,得到多个数据包;
    对所述多个数据包分别进行位平面压缩预处理;
    将经过所述位平面压缩预处理后的多个数据包分发给多路编码器以并行地执行位平面压缩处理,得到多个BPC编码包;
    将所述多个BPC编码包合并以输出第二压缩数据和第二附加信息。
  19. 由权利要求18所述的装置,其特征在于,所述第二附加信息包括:
    第二行长度信息和分包长度信息,其中,所述第二行长度信息用于指示所述待压缩数据中的每行数据在所述第二压缩处理之后的数据长度,所述分包长度信息用于指示各个所述BPC编码包的数据长度。
  20. 由权利要求18所述的装置,其特征在于,所述第二压缩单元还用于:利用多路复用机制对所述多个数据包分别进行所述位平面压缩预处理。
  21. 由权利要求18所述的装置,其特征在于,所述第二压缩单元还用于:
    根据预设长度对所述待压缩数据或所述第一压缩数据进行分包;
    其中,若所述多个数据包中的最后一包不足所述预设长度,则不压缩所述多个数据包中的最后一包,或对所述多个数据包中的最后一包进行补0处理。
  22. 由权利要求16~21中任一项所述的装置,其特征在于,压缩模块还用于:
    响应于第一使能指令,顺序地执行所述第一压缩处理与所述第二压缩处理,输出第二压缩数据、所述第一附加信息和所述第二附加信息作为所述已压缩数据;
    响应于第二使能指令,单独执行所述第一压缩处理,输出所述第一压缩数据、所述第一附加信息作为所述已压缩数据;
    响应于第三使能指令,单独执行所述第二压缩处理,输出所述第二压缩数据、所述第二附加信息作为所述已压缩数据。
  23. 一种数据解压缩装置,其特征在于,包括:
    获取模块,用于获取已压缩数据,所述已压缩数据利用如权利要求1-10中任一项所述的数据压缩方法而生成;
    解压缩模块,用于利用如权利要求1-10中任一项所述的数据压缩方法的逆步骤对所述已压缩数据执行解压缩处理,以复原所述已压缩数据。
  24. 一种基于数据压缩和解压缩的处理装置,其特征在于,包括:
    数据压缩装置,用于利用如权利要求1-10中任一项所述的方法对所述待压缩数据执行压缩处理以得到已压缩数据,并将所述已压缩数据传输并存储于外部存储器;
    数据解压缩装置,用于获取存储在所述外部存储器的所述已压缩数据,并利用如权利要求11所述的方法对所述已压缩数据执行解压缩处理,从而将所述已压缩数据复原为所述稀疏数据,并将所述复原得到的稀疏数据输入所述神经网络模型以执行运算。
PCT/CN2020/118674 2019-12-03 2020-09-29 数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置 WO2021109696A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911217983.3 2019-12-03
CN201911217983.3A CN110943744B (zh) 2019-12-03 2019-12-03 数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置

Publications (1)

Publication Number Publication Date
WO2021109696A1 true WO2021109696A1 (zh) 2021-06-10

Family

ID=69908661

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118674 WO2021109696A1 (zh) 2019-12-03 2020-09-29 数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置

Country Status (2)

Country Link
CN (1) CN110943744B (zh)
WO (1) WO2021109696A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110943744B (zh) * 2019-12-03 2022-12-02 嘉楠明芯(北京)科技有限公司 数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置
US11362670B2 (en) * 2020-10-30 2022-06-14 International Business Machines Corporation ReLU compression to reduce GPU memory
CN112988673B (zh) * 2021-02-22 2023-02-28 山东英信计算机技术有限公司 一种处理解压缩过程中数据溢出的方法和设备
WO2023231571A1 (zh) * 2022-06-02 2023-12-07 华为技术有限公司 数据压缩方法及装置
US11750213B1 (en) 2022-08-31 2023-09-05 Hong Kong Applied Science and Technology Research Institute Company Limited Train-linking lossless compressor of numeric values

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106612438A (zh) * 2016-01-28 2017-05-03 四川用联信息技术有限公司 一种基于重叠分区高级小波变换技术的图像压缩方法
CN107590533A (zh) * 2017-08-29 2018-01-16 中国科学院计算技术研究所 一种用于深度神经网络的压缩装置
CN108628807A (zh) * 2017-03-20 2018-10-09 北京百度网讯科技有限公司 浮点数矩阵的处理方法、装置、设备及计算机可读存储介质
US20190081637A1 (en) * 2017-09-08 2019-03-14 Nvidia Corporation Data inspection for compression/decompression configuration and data type determination
CN110377288A (zh) * 2018-04-13 2019-10-25 赛灵思公司 神经网络压缩编译器及其编译压缩方法
CN110943744A (zh) * 2019-12-03 2020-03-31 杭州嘉楠耘智信息科技有限公司 数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8457425B2 (en) * 2009-06-09 2013-06-04 Sony Corporation Embedded graphics coding for images with sparse histograms
CN102474565B (zh) * 2009-09-09 2016-03-30 索尼公司 在无线hd1.1中用于图形模式压缩的比特流语法
CN103402043A (zh) * 2013-08-16 2013-11-20 中国科学院长春光学精密机械与物理研究所 一种大视场空间tdiccd相机的图像压缩装置
CN107888197B (zh) * 2017-10-31 2021-08-13 华为技术有限公司 一种数据压缩方法和装置
CN110008965A (zh) * 2019-04-02 2019-07-12 杭州嘉楠耘智信息科技有限公司 目标物识别方法及识别系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106612438A (zh) * 2016-01-28 2017-05-03 四川用联信息技术有限公司 一种基于重叠分区高级小波变换技术的图像压缩方法
CN108628807A (zh) * 2017-03-20 2018-10-09 北京百度网讯科技有限公司 浮点数矩阵的处理方法、装置、设备及计算机可读存储介质
CN107590533A (zh) * 2017-08-29 2018-01-16 中国科学院计算技术研究所 一种用于深度神经网络的压缩装置
US20190081637A1 (en) * 2017-09-08 2019-03-14 Nvidia Corporation Data inspection for compression/decompression configuration and data type determination
CN110377288A (zh) * 2018-04-13 2019-10-25 赛灵思公司 神经网络压缩编译器及其编译压缩方法
CN110943744A (zh) * 2019-12-03 2020-03-31 杭州嘉楠耘智信息科技有限公司 数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置

Also Published As

Publication number Publication date
CN110943744A (zh) 2020-03-31
CN110943744B (zh) 2022-12-02

Similar Documents

Publication Publication Date Title
WO2021109696A1 (zh) 数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置
KR102368970B1 (ko) 지능형 고 대역폭 메모리 장치
US8217813B2 (en) System and method for low-latency data compression/decompression
US8847798B2 (en) Systems and methods for data compression and parallel, pipelined decompression
TW201732567A (zh) 用於資料解壓縮的硬體設備及方法
CN109379598B (zh) 一种基于fpga实现的图像无损压缩方法
US7800519B2 (en) Method and apparatus for compressing and decompressing data
US11955995B2 (en) Apparatus and method for two-stage lossless data compression, and two-stage lossless data decompression
CN103546161A (zh) 基于二进制位处理的无损压缩方法
US20190044699A1 (en) Reconfigurable galois field sbox unit for camellia, aes, and sm4 hardware accelerator
CN112200713A (zh) 一种联邦学习中的业务数据处理方法、装置以及设备
Choi et al. A high-throughput hardware accelerator for lossless compression of a DDR4 command trace
Tomari et al. Compressing floating-point number stream for numerical applications
CN112883982B (zh) 一种面向神经网络稀疏特征的数据去零编码及封装方法
JP2015103077A (ja) 演算処理装置、情報処理装置、及び、情報処理装置の制御方法
US20190319787A1 (en) Hardware acceleration of bike for post-quantum public key cryptography
WO2021143634A1 (zh) 算术编码器及实现算术编码的方法和图像编码方法
WO2023159820A1 (zh) 图像压缩方法、图像解压缩方法及装置
CN113300829B (zh) Sm3算法的硬件实现装置
WO2021185287A1 (zh) 一种解压装置、加速器、和用于解压装置的方法
US10411733B2 (en) Data compression and decompression
CN114884618A (zh) 一种基于gpu的5g多用户ldpc码高速译码器及其译码方法
CN113946453A (zh) 数据处理方法及系统
CN103152567A (zh) 一种任意阶数指数哥伦布编码器及其方法
CN102545910B (zh) 一种jpeg霍夫曼解码电路及其解码方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20897188

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20897188

Country of ref document: EP

Kind code of ref document: A1