CN114697673B

CN114697673B - Neural network quantization compression method and system based on inter-stream data shuffling

Info

Publication number: CN114697673B
Application number: CN202011607729.7A
Authority: CN
Inventors: 何皓源; 王秉睿; 支天; 郭崎
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2023-06-27
Anticipated expiration: 2040-12-30
Also published as: CN114697673A

Abstract

The invention provides a neural network quantization compression method and a system based on inter-stream data shuffling, comprising the steps of obtaining the neural network data to be compressed after quantization processing, and partitioning the neural network data to obtain a plurality of data blocks; distributing a data stream to each data block for compression, and randomly selecting an input buffer memory for the data stream or selecting the input buffer memory according to a preset rule to obtain a compression result of the data block; collecting compression results of all data blocks to be used as compression results of the neural network data; the invention avoids continuously reading data from the same input buffer memory, increases the input randomness of single data stream, and further balances the coding rate of each data stream, thereby improving the utilization efficiency of hardware resources.

Description

Neural network quantization compression method and system based on inter-stream data shuffling

Technical Field

The invention relates to the field of neural network operation, in particular to a neural network quantization compression method and system based on inter-stream data shuffling.

Background

In recent years, artificial intelligence has been rapidly developed in the context of double explosion of information amount and hardware computing power, and has become a major driving force for productivity development and technical innovation. As a main branch of the artificial intelligence technology, the neural network algorithm has encountered a technical bottleneck with complex structure, huge parameter and calculation amount in order to further improve the model precision, and limits the application of the neural network model in the scene of pursuing throughput rate and energy efficiency ratio, so that the calculation efficiency becomes a main research target in the next stage. The most effective neural network compression method at present combines low precision and sparsification, can reduce the parameter quantity of the neural network to a certain extent, but cannot further mine pruning and data redundancy in the quantized neural network model.

Pruning (Pruning), as shown in fig. 1, may also be referred to as model thinning, that is, a part of parameters with lower importance in the neural network is directly cut to 0, the computation related to the part of parameters is also masked, and according to the cut basic unit, the parameters can be divided into single parameter Pruning and structural parameter Pruning.

Quantization (Quantization), as shown in fig. 2, may also be referred to as model low-bit-rate, that is, a low-precision representation method reduces the number of bits required for storing part or all of data in a neural network model, thereby reducing the number of parameters and simplifying most floating point operations at the same time, so as to achieve the purpose of model compression.

The rationality of the network quantization comes from the data distribution characteristics of the neural network itself, for example, the output neurons passing through the activation layer may contain many 0 s, and most of the absolute values of the weights fall within the [0,1] interval, so that a large amount of data overflow is avoided, and in combination with the retraining process, we can easily recover the precision of the model after low-precision quantization. On the other hand, in order to ensure the training convergence speed and accuracy of the neural network, the deep learning development frameworks such as Caffe, tensorflow and Pytorch mainly use 32-bit or 64-bit floating point numbers, which is generally a waste in the ASIC or FPGA hardware implementation, and not only causes more storage and calculation overhead, but also significantly increases the area and power consumption of the arithmetic unit in the chip.

For most of network model reasoning (information) operation, low-precision data about 8 bits can meet the requirement of network performance, so that the technical application of the industry to neural network quantization is very mature, and the 8bit fixed point number neural network operation and storage are supported and optimized by a TensorRT operation library of Nvidia and a plurality of deep learning acceleration architectures at home and abroad.

Data reduction (DataReduction), as shown in fig. 3, is a method of replacing raw data with a small portion of data to reduce the number of parameters in the neural network model. Data reduction is originally a concept in data preprocessing, and is not limited to downsizing the data set of the neural network, but is gradually used for exploring the parameters of the neural network and even the compression method of the network structure. In the form of compressed data, data reduction can be further divided into two categories, dimension reduction (dimension reduction) and number reduction (dimension reduction).

The objective of dimension reduction is to reduce the number of independent variables of the learnable parameters, such as wavelet transform, principal component analysis or discrete cosine transform, etc. on the parameters in the model, and then perform quantization or entropy coding and the like by utilizing the transform domain characteristics of the data, and combining pruning, quantization, retraining, etc. methods can reduce the redundancy of the neural network model from the transform domain perspective. The idea is similar to a processing method adopted for direct current components in JPEG coding of image compression, data locality in a neural network is utilized, but essential difference between FeatureMap and a natural image is considered, a compression strategy of transform domain parameters needs to be carefully explored, so that unrecoverable loss of model precision after inverse transformation is avoided, and the whole compression method is invalid.

Quantity reduction focuses on methods that more intuitively reduce the number of parameters, such as linear regression, clustering, sampling, etc., to achieve parameter sharing within a certain range, where the granularity of sharing may be a parameter matrix, channels, convolution kernels, layers, or even sub-networks.

By integrating the existing research results, pruning, quantization and data reduction can be found to occur simultaneously in a comprehensive neural network compression method, and a direction is provided for hardware optimization to a certain extent. This phenomenon is because parameters in the model must be grouped according to a certain evaluation index in the process of forming the parameter sharing strategy, so that the quantity reduction and the pruning and quantization method have natural coupling relation, and the granularity of parameter sharing can be very natural as the granularity of pruning and quantization.

On the other hand, the traditional data compression method (entropy coding, transform coding, run-length coding, hybrid coding and the like) is widely applied to the neural network compression method, is used for carrying out final compression on various neural network models subjected to pruning, quantization and data reduction, can save 20% -30% of storage expenditure, can improve the compression ratio, and can also remarkably improve the calculation performance of reasoning and training after being deployed on related hardware acceleration architecture.

In addition, the neural network compression research also extends the meaning of lossless compression, whether pruning, quantization or data reduction, the parameters in the neural network can be distorted, and the accuracy of the compressed model can be recovered through retraining, but the obtained model data has irreversible change, so that the interpretability of the neural network algorithm is further reduced, and the definition of lossless compression in classical data compression is different. It is therefore necessary to state here that the lossless compression referred to herein does not refer to a neural network compression method in which the model accuracy is lossless, but a data compression method in which the data itself can be restored without any distortion.

Pruning and quantized data distribution. The data distribution among the neural networks conforms to the principle of large numbers, so that the data among the pre-trained models approximately presents a normal distribution with 0 as the mean. Because of the decrease of the numerical resolution, the low-precision quantization may become a discrete normal distribution, and overflow occurs on the numerical values beyond the representation range, but the envelope of the distribution remains unchanged, taking the last convolutional layer Conv5 in AlexNet as an example, and fig. 4 shows the weight distribution of the layer after quantization to 8bit fixed point numbers. At the same time, pruning will result in a portion of the parameters whose absolute value is near 0 being clipped to 0, as shown in fig. 5, a distributed hole will appear near 0, where the data is concentrated on a peak equal to 0.

Under the dual constraint of pruning and quantization, retraining can restore model accuracy, but cannot retrieve the normal distribution of the pre-trained model, but rather presents a bimodal distribution as shown in fig. 6. From this phenomenon we can see that the compression method combining pruning, quantization and retraining can introduce a large amount of 0 s and can deeply change the data distribution in the neural network.

The Huffman coding can reach entropy limit on the character sequence with the frequency distribution conforming to 2-n, meanwhile, the run-length coding does not have second-order coding, and then the following two positions to be optimized can be mined by combining the change of the data distribution:

(1) Compared with the original normal distribution, the occurrence frequency of each digit after retraining is more average, which may cause the average code length of Huffman coding to increase, so that the data must be pre-coded before entropy coding to improve the occurrence frequency of Huffman coding input characters.

(2) Although retraining enables recovery of a portion of the clipped parameters, there are still a large number of 0's in the weights and they do not necessarily occur continuously, so a coding method that can fully compress these 0's is needed to increase the compression ratio of the pruned model.

The neural network compression method is not only an improvement for the neural network algorithm, but also the design and application of the data compression method, so that the effect of evaluating the neural network compression must be from two aspects, namely precision and compression ratio, and in order to measure the influence on the hardware bandwidth more reasonably, the code rate should be introduced as an evaluation index.

The Accuracy (Accuracy) of the neural network compression method described above changes the neural network, either numerically or structurally, which necessarily affects the performance of the model on its target task, and whether the Accuracy of pre-training can be recovered depends on whether the damage caused by the neural network compression algorithm to the model is reversible. Therefore, after compression and a series of retraining operations, the recognition accuracy of the neural network is a primary index for evaluating the compression method, especially in comparison with the pre-trained model. For target detection, the accuracy of Top-1 and Top-5 is expressed, and mAP and the like are needed to be considered in the tasks of pattern recognition and scene segmentation.

The compression ratio (compression ratio), which is the ratio of the size of the original data and the compressed data, is an intuitive judgment on the data compression method and is a general evaluation standard of all compression algorithms.

The bit rate (BitRate), i.e. the average number of bits needed to represent a character in the compressed data, has a meaning similar to the weighted average code length in huffman coding, and a low bit rate indicates that the current compression method is more compatible with the original data distribution, the less hardware bandwidth is needed to handle the compressed data.

Based on the three evaluation indexes, the invention provides a brand-new lossless compression algorithm, and aims to design a coding method which is more suitable for a neural network model under the reference of the prior art foundation and an evaluation system, so as to substantially optimize the neural network compression method.

Disclosure of Invention

The invention provides a coding and decoding method for parallel multiple streams based on normalized HSF (Huffman-Shannon-Fano) coding, run-length coding and further all-zero substitution coding, a Shuffle method for improving compression ratio by data shuffling among streams and a FakeLiteralert method for avoiding deadlock by balancing compression speed among streams, and finally provides a high-efficiency neural network lossless compression coding and a hardware pipeline implementation scheme thereof.

Specifically, the invention provides a neural network quantization compression method based on inter-stream data shuffling, which comprises the following steps:

Step 1, acquiring the neural network data to be compressed after quantization processing, and partitioning the neural network data to obtain a plurality of data blocks;

step 2, allocating a data stream to each data block for compression, wherein the compression of each data stream comprises the following steps: performing run Cheng Quanling coding on the data block to obtain run compressed data, wherein the run Cheng Quanling coding only performs run coding on zero characters in the neural network data, performs normalized Huffman coding on the run compressed data, and reforms a coding result to obtain normalized Huffman coding as a compression result of the data block;

step 3, collecting compression results of all data blocks to be used as compression results of the neural network data;

wherein, this step 2 includes: setting input caches with the same number as the data streams, wherein each data stream is provided with an output cache for caching the compression result of the data block; the data stream randomly selects an input buffer or selects an input buffer according to a preset rule and obtains the data block therefrom.

According to the neural network quantization compression method based on the inter-stream data shuffling, each data stream is provided with an independent input buffer memory and an output buffer memory, the input buffer memory is used for buffering the data block, and the output buffer memory is used for buffering the compression result of the data block.

The neural network quantization compression method based on the inter-stream data shuffling comprises the following steps:

step 21, monitoring the data volume of each data stream after compression coding, and judging whether P is satisfied _current -P _min If yes, writing virtual codes corresponding to the virtual characters into an output buffer of the current data stream, otherwise, executing the step 21 again; wherein P is _current Input queue width, P, for current data flow _min The input queue width for the data stream with the slowest encoding rate, S is the pipeline depth for the data stream.

The neural network quantization compression method based on the inter-stream data shuffling comprises the steps of writing virtual coding data quantity P into an output buffer memory of a current data stream _current -P _min 。

The method for compressing neural network quantization based on data shuffling between streams, wherein the step 21 comprises: writing virtual characters into an output buffer of the current data stream to obtain third intermediate data; judging whether the character identical to the virtual character in the third intermediate data is the original character in the output buffer of the current data stream, if so, replacing the character identical to the virtual character in the third intermediate data with the virtual code, and adding a flag bit representing that the character is the original character at the back, otherwise, replacing the character identical to the virtual character in the third intermediate data with the virtual code, and adding a flag bit representing that the character is the replacement character at the back.

The invention also provides a neural network quantization compression system based on the data shuffling among streams, which comprises:

the module 1 is used for acquiring the neural network data to be compressed after quantization processing, and partitioning the neural network data to obtain a plurality of data blocks;

a module 2, configured to allocate a data stream to each data block for compression, where the compression of each data stream includes: performing run Cheng Quanling coding on the data block to obtain run compressed data, wherein the run Cheng Quanling coding only performs run coding on zero characters in the neural network data, performs normalized Huffman coding on the run compressed data, and reforms a coding result to obtain normalized Huffman coding as a compression result of the data block;

a module 3, configured to aggregate compression results of the data blocks as compression results of the neural network data;

wherein the module 2 comprises: setting input caches with the same number as the data streams, wherein each data stream is provided with an output cache for caching the compression result of the data block; the data stream randomly selects an input buffer or selects an input buffer according to a preset rule and obtains the data block therefrom.

The neural network quantization compression system based on the data shuffling among streams is characterized in that each data stream is provided with an independent input buffer memory and an output buffer memory, the input buffer memory is used for buffering the data block, and the output buffer memory is used for buffering the compression result of the data block.

The neural network quantization compression system based on the data shuffling between streams, wherein the module 2 comprises:

a module 21 for monitoring the data volume of each data stream after compression coding to determine whether P is satisfied _current -P _min If yes, writing virtual codes corresponding to the virtual characters into an output buffer of the current data stream, otherwise, executing the module 21 again; wherein P is _current Input queue width, P, for current data flow _min The input queue width for the data stream with the slowest encoding rate, S is the pipeline depth for the data stream.

The neural network quantization compression system based on the data shuffling between streams, wherein the data volume of writing virtual codes into the output buffer memory of the current data stream is P _current -P _min 。

The neural network quantization compression system based on the data shuffling between streams, wherein the module 21 comprises: writing virtual characters into an output buffer of the current data stream to obtain third intermediate data; judging whether the character identical to the virtual character in the third intermediate data is the original character in the output buffer of the current data stream, if so, replacing the character identical to the virtual character in the third intermediate data with the virtual code, and adding a flag bit representing that the character is the original character at the back, otherwise, replacing the character identical to the virtual character in the third intermediate data with the virtual code, and adding a flag bit representing that the character is the replacement character at the back.

The advantages of the invention are as follows:

1. aiming at the characteristic that the quantized neural network data has sparsity, the invention improves the run length coding and provides the run Cheng Quanling coding, so that the neural network data can be compressed more efficiently in a lossless manner;

2. the run-length all-zero coding comprises second-order character substitution, so that the compression efficiency is further improved, the number of 0 occurrence in data is reduced, and more compression space is reserved for subsequent Huffman coding;

3. the Huffman tree is reformed from top to bottom, so that the complete Huffman tree structure is saved, and the complexity of table lookup operation is obviously reduced;

4. by writing virtual codes corresponding to virtual characters into an output buffer memory of a data stream with low speed, the compression speed among streams is balanced, and the coding gap among pipelines is reduced, so that deadlock is avoided.

5. The data streams are randomly selected to be input into the buffer memory or are selected to be input into the buffer memory according to a preset rule, so that the situation that data are continuously read from the same input buffer memory is avoided, the input randomness of single data streams is increased, the coding rate of each data stream is further balanced, and the utilization efficiency of hardware resources is improved.

Drawings

FIG. 1 is a diagram of network pruning and parameter recovery;

FIG. 2 is a graph of progressive quantization of network parameters;

FIG. 3 is a parameter sharing and clustering center fine-tuning diagram;

FIG. 4 is a graph of quantized Conv5 layer weights;

FIG. 5 is a graph of Conv5 layer weights after quantization and pruning;

FIG. 6 is a distribution diagram of Conv5 layer weights after retraining;

FIG. 7 is a first stage encoding result diagram;

FIG. 8 is a second stage encoding rule diagram;

FIG. 9 is a second stage encoding result diagram;

FIGS. 10a and 10b are diagrams of normalized Huffman tree examples;

FIG. 11 is a graph of data split in parallel compression;

FIG. 12 is a diagram of a multi-data stream codec;

FIG. 13 is a schematic diagram of the Last bit;

FIG. 14 is a diagram of a Head-Data storage format;

FIG. 15 is a schematic diagram of the generation of a deadlock;

FIG. 16 is a FakeLiteral substitution rule diagram;

fig. 17 is a flow chart of the present invention.

Detailed Description

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

The method aims at optimizing the neural network compression method, analyzes the distribution characteristics of data in a neural network model after pruning and quantization, provides a lossless compression algorithm combining entropy coding, run-length coding and all-zero coding, fully explores the deployment form of the lossless compression algorithm on hardware, finally designs and realizes an NNCODEC neural network coding and decoding simulator, performs longitudinal and transverse comparison experiments on the 7 neural network models which are mature at present, and proves the optimization effect of the hybrid coding on the neural network compression method from a software layer and also provides a hardware design scheme which is easy to realize.

Run length encoding, in which a combination of the runs of the word Fu Jiayou is used to replace a number of repeated characters, the number of consecutive occurrences of the same character is called a run length, for example, the sequence of characters { AABBCCCD } is compressed into { A2B2C3D1} by run length encoding.

Run-length encoding is an effective data compression method, and a reasonable run-length bit width needs to be designed in specific implementation, so that repeated compression cannot be caused by too short run length, and the compression ratio cannot be reduced by the run Cheng Taichang. Furthermore, we can observe in principle that this method cannot be used with data that has already been run-length encoded.

A reasonable run bit width should first be selected for the run Cheng Quanling encoding. Considering that the current neural network compression model adopts a fixed point number with 8bit or lower bit width to carry out quantization more, the invention adopts a shorter 2bit run, namely, 4 continuously appeared 0 can be compressed into the form of digital 0 and run 3 at most. Taking the 8bit fixed point number format of the sequence {1,0,0,0,0,2,0,255,0,0} to be encoded as an example, the encoding result of the first stage is shown in fig. 7.

The total bit width of the data to be encoded is 80 bits, and the total code length after the first stage compression is 64 bits, if the total code length obtained by adopting the traditional run length encoding is 72 bits, considering that the compressed neural network model is 0 which continuously appears in a large quantity, the non-zero data without the run length can be considered as an improvement on the run length encoding in the calculation of the neural network.

Special character substitution and Extra bits. In order to further improve the compression ratio of all zero data and introduce no excessive overhead, we select a character (e.g. 255) with extremely low possibility of occurrence in the original data as zeroLiteral in the second stage, and replace the coding segment with run 3 in the existing data; in order to distinguish from a character that is numerically equal to zeroliter, it is also necessary to replace it with another character (e.g. 254) that has a very low probability of occurrence, called ZeroExtra, with a 1bit Extra bit added thereafter to distinguish from a character that is numerically equal to ZeroExtra, as shown in fig. 8.

It should be noted that, the method of adding 1bit directly behind zerol is not selected here, but the above-mentioned second order character substitution is adopted, because the frequency of occurrence of characters equal to zerol and ZeroExtra in value is very low, and compared with the method of adding 1bit to the 4 corresponding codes appearing in succession, the method of performing character substitution again and then distinguishing with Extra bits is more reasonable. The result of the second stage of encoding is shown in fig. 9.

So far we have completed one run Cheng Quanling encoding, and then can summarize the decoding flow under the general conditions:

(1) Special character reduction, if zerolitteral is checked in the compressed data, replacing it with number 0 and run 3; if ZeroExtra is checked among the compressed data, the Extra bit thereafter is checked, and if Extra bit is 0, zeroExtra is replaced, otherwise ZeroExtra is replaced.

(2) Run length decoding, performing run length decoding on 0 in the data reduced by the special characters, expanding the subsequent 2bit run length, and writing 0 with the corresponding number into the result.

The above is the description of the algorithm of the run Cheng Quanling coding, and finally, the data to be coded of 80 bits in the example is compressed to 49 bits, and the method is obviously superior to the run coding in the model after pruning and quantization due to the adoption of the strategies of all-zero coding and character replacement. In addition, the method can directly replace 0 which continuously appears by zeroLiteral with 4 times of compression ratio, so that 0 in data is reduced, and more compression space is reserved for subsequent Huffman coding.

Normalized Huffman coding. The Huffman codes corresponding to the same data distribution are not unique, because nodes with the same frequency may appear in the Huffman tree construction process, and in order to better apply Huffman codes, a fixed and efficient Huffman tree construction method is needed, and the unique Huffman codes generated by the method are called normalized Huffman codes.

Huffman tree reforming. Specifically, the present invention employs HSF encoding as the normalized huffman encoding. The main idea of the coding is to reform the Huffman tree from top to bottom, the leaf node in the same level of nodes is preferentially moved to the left side of the binary tree, all the codes can be obtained by adding 1 and adding 1 to the left shift on the premise of not changing the frequency distribution, and then the complex binary tree traversal or complete table lookup operation is replaced by the comparison and addition operation, so that the storage and calculation cost required by the coding and decoding is greatly reduced. As shown in fig. 10a, 10b and table 1, given the character sequences u1, u2, u3, u4, u5 and any two sets of huffman trees, their normalized forms can be found using the reforming method described above.

Table 1 normalized Huffman coding example

The code table redefines. It can be seen that each huffman tree corresponds to its normalized form, and meanwhile, normalized codewords have a hardware-friendly operation rule: the code table can be redefined by using the rule after the same code length is plus 1 and different code lengths are plus 1, so that the complete Huffman tree structure is saved thoroughly, and the complexity of the table lookup operation is obviously reduced.

Taking the sequence to be encoded (a) in the previous section as an example, the HSF encoding and decoding process needs the following code tables:

(1) CharTable, all the characters to be encoded are arranged in descending order of appearance frequency, namely { u1, u4, u5, u2, u3} in the case (a), and the encoding and decoding can be multiplexed.

(2) LenTable, the HSF codes in example (a) are only 2bit and 3bit, the corresponding LenTable is {2,3}, the coding and decoding can be multiplexed.

(3) RangeTable, the last codeword of each effective code length corresponds to the sequence index of the character in CharTable, and the last codeword of 2bit and 3bit in example (a) corresponds to u5 and u3 respectively, so RangeTable is {2,4}, and is only used in the encoding stage.

(4) LimitTable, the last codeword of each effective code length has values, e.g., 2bit and 3bit in (a) of 10 and 111, respectively, so LimitTable is {2,7}, which is used only in the decoding stage.

(5) BaseTable, limitTable minus RangeTable, baseTable in example (a) is {0,3}, codec can be multiplexed.

Based on this, the process of generating the HSF code for character u4 and the code table in example (a) can be divided into the following three steps:

lookup, access CharTable, get the ordering rank of u4, namely rank (u 4) =1;

Compare, access RangeTable, get the first item index greater than or equal to rank (u 4), because rank (u 4). Ltoreq.2, index (u 3) =0;

add, access BaseTable and LenTable, obtain base value base and code length len corresponding to index of subscript, base and rank Add to be the numerical value of codeword, combine len can get HSF code of character u 4: base (u 4) =0, len (u 4) =2, code (u 4) =0+1=1, so the final encoding result is 01.

Correspondingly, we can derive the decoding flow that parses the first character u4 from a string of HSF streams like 01 xxx:

compare, visit LimitTable and LenTable, traverse by a subscript index accumulated from 0, until meeting limit ≡first len bit of the code stream, because the subscript value of 0 in LimitTable is ≡first 2 bits of the code stream, index=0, limit=2, len=2, code=1;

sub, accessing BaseTable, subtracting a base value corresponding to index from a code value, namely rank=1-0=1;

lookup, access CharTable, get the character corresponding to rank, so the final decoding result is the item of rank 1 among CharTable, namely character u4.

The above is an introduction to the normalized Huffman coding algorithm, so that the storage and operation structure of the Huffman coding can be simplified by the HSF coding, and meanwhile, the coding and decoding processes are naturally divided into 3-stage pipelines, thereby providing an efficient and reasonable implementation scheme for hardware deployment.

A multi-data stream parallel compression method. Combining the foregoing coding of run Cheng Quanling with the normalized huffman coding, we have already provided the algorithm basis of the hybrid coding proposed by the present invention, and next need to discuss how to improve the compression efficiency of this coding method on hardware.

In the context of parallel computing (ParallelComputing), most serial operations use Pipeline (Pipeline) techniques to implement time parallelism. On the other hand, with the development of multi-core architecture, technologies such as single instruction multiple data (simd) streams (SingleInstructionMultipleData, SIMD) for data level parallelism have also begun to be used in various data compression methods. In the light of the similar approach, a parallel compression approach based on data blocking and pipelined codec is presented herein, and its hardware implementation is contemplated and optimized.

Pipeline level partitioning and data chunking. Based on the predictions of the complexity of each serial operation, it can be considered that the existing compression method can be divided into 5 stages of pipelines in hardware, as shown in table 2. And because of symmetry in the encoding and decoding operations, the compression and decompression stages can also multiplex a part of operation and storage structures, thereby improving the utilization rate of the hardware unit.

Table 2 codec pipeline level partitioning

Based on the pipelined codec structure, we need to Block (Block) the fixed-size batch (Chunk) of data to be compressed, as shown in fig. 11. And simultaneously, distributing a data Stream (Stream) for compression or decompression to each block, wherein each data Stream comprises the hardware units required by the 5-stage pipeline, and all the data blocks and the encoding and decoding streams jointly complete the parallel compression and decompression operation proposed herein.

In order to fully consider various influencing factors in subsequent experiments and thereby evaluate the performance of the parallel compression and decompression method more scientifically, we need to define the independent variables in the codec structure in advance and name the basic granularity of input and output on the basis.

From the standpoint of increasing the processing speed, it is of course desirable to have more data streams, which increases the area consumption overhead of the codec by a multiple, so the number of streams is a parameter that needs to be weighed by experimental analysis.

The Chunk, i.e. the batch size of data to be compressed, is most intuitively selected to take all data to be compressed as one Chunk, but thus higher-level data parallelism cannot be performed by copying multiple codec structures (e.g. multi-core processing structures, etc., and simply spreading more streams cannot obtain better compression effect due to hardware limitation and data locality), so we will take the Chunk size as one of the variables controlled in experiments.

Dword, i.e., the input queue width of each Stream, together with the number of streams determines the Block size, and usually selects a value comparable to the width of the data type to be compressed, so as to avoid performance bottlenecks or waste of hardware resources in the level 1 pipeline due to the serialization of run-length encoding.

Cword, the output queue width of each Stream, can be divided by the expected compression ratio to estimate a reasonable value.

Fig. 12 shows a specific codec structure, where Dword is 16 bits and Dword is 4 bits, and there are 64 streams in total, that is, 1024bit blocks can be processed in parallel to obtain 256bit compressed data. Notably, the streams are synchronized in input and output, internally asynchronous, to ensure operation synchronization at Block granularity, facilitating subsequent hardware units to store compressed data generated simultaneously by multiple streams per cycle at a fixed data size.

In order to solve the aforementioned problem that Chunk cannot be divided by Block, the pipeline stage in each Stream needs to add 1bit Last bit in addition to the necessary encoding and decoding hardware, and allows a part of streams to empty the pipeline in advance and end the compression process under the constraint of input-output synchronization, and when the Last bit of the output ports of all streams is 1, the compression of the current Block is completed, as shown in fig. 13.

In addition, since we use a variable length coding method, that is, the code length generated after compression is uncertain, the code word after compression of the character with smaller occurrence probability may exceed the data bit width of itself, and in extreme cases, the compressed data size may exceed the size of Block, thereby causing compression failure. Therefore, the periphery of the above-described codec structure should also incorporate a Bypass (Bypass) design, i.e., an alternative operation after the current Block compression fails.

Specifically, if the compressed data obtained at the output end exceeds the size of the current Block, the compression is abandoned, the original data is directly used as the compression result of the current Block, and the operation of each stage of running water is skipped in the decompression stage. The next section will introduce a storage format that enables compression Bypass as well as parameterization of the pipeline.

Head-Data storage format. In order to implement Bypass operation and other more flexible encoding and decoding modes, whether operations in each stage of pipeline are enabled or not is configured through parameters, so as to conveniently evaluate the role played by each stage of operation in the whole scheme while improving compression efficiency, a Head-Data storage format of compressed Data is proposed herein, that is, the compressed Data is divided into a Head (Head) and a Data (Data), wherein address offset, data length and compression configuration (for example, enabling of each stage of operation, bypass enabling, etc.) of the Data portion are stored in the Head, as shown in fig. 14.

On the basis, the compression and decompression operation of the instruction level needs to add a configuration stage before the encoding operation is completed by the codec, signals such as a code table (CharTable, etc.) and pipeline level enabling, etc. required by each stage of pipeline operation are added, and then signals such as pipeline level enabling, bypass enabling, etc. are acquired from a Head part of compressed Data during decoding operation, so that a correct decoding flow is executed on a Data part, and the decompression operation is completed.

In addition, because the Data portion is located by using the address offset, the Head-Data storage format also ensures the correctness of the operation of padding output Data to 256 bits after a part of Stream in fig. 13 finishes compression in advance, that is, compressed Data corresponding to each Block has an accurate storage address, so that adding some meaningless Data at the end of the Data portion will not affect the effective execution of the decompression process.

Fakelikeral is balanced with the pipeline. Because the codec structure has the characteristics of fixed-length input, variable-length output, port synchronization and internal asynchronization, under the condition that compression speeds among multiple streamers are extremely unbalanced, the synchronization constraint of the input/output queues may cause deadlock (deadhook) to occur.

Firstly, it is clear that the input and output of all streamers are synchronously performed, that is, all first-stage pipelines are empty and input, all last-stage pipelines are full and output, and intermediate stages can independently execute operations and wait only when the previous stage is empty or the next stage is full. Without loss of generality, we assume an example where the total number of streams is 2, the Dword width is 8, the Cword width is 8, and the FIFO depth is 4, and fig. 15 shows the cause of the deadlock phenomenon.

To simulate the extreme case of imbalance between multiple streamers, we assume that Stream1 encodes 8bit data as 1bit and Stream2 encodes 8bit data as 16 bits; through the input of the first 3 cycles, the data in the pipeline is full, the output is generated from the 4 th Cycle, and the 4 items of the output FIFO of the Stream2 are full at the 5 th Cycle, but since the output FIFO of the Stream1 is only 2 bits currently and is not enough to provide a complete Cword, the Stream2 is blocked from this moment, and all the Stream levels cannot receive the input; after the 8 th Cycle, all pipeline stages in Stream1 are drained, but only 5 bits are in the output FIFO, and the output is still insufficient, and the input end cannot be input because of being blocked by the first stage pipeline of Stream2, so that the deadlock state is entered.

After observing, we can find that by the time the deadlock occurs, stream1 completes the encoding of 5 Dwords, stream2 only completes 2, and 3 Dwords differing from each other are temporarily stored as intermediate data in the pipeline stage of Stream2, thus causing the total blocking of the pipeline. Guided by this conclusion, we can properly insert cavitation bubbles in the Stream with the fastest encoding speed, reducing the gap between pipelines, and thus avoiding deadlock.

To address this problem, the concept of fakeLiteral, namely "virtual characters", is presented herein: the data is input as a duty bubble in the encoding process, is directly discarded in the decoding process, and is not used as effective data. Similar to the special character zeroLiter introduced in section 3.2, fakeLiter is also represented by a character of extremely low frequency of occurrence in the data to be encoded, and also differs from the real character by using a strategy of second order substitution and addition of Extra bits, as shown in FIG. 16 (253 and 252 for FakeLiter and FakeExtra, respectively), while the identification and discarding operations of FakeLiter by the decoding stage can also be combined into this stage of Alter.

From the previous derivation, it can be seen that the condition for deadlock generation is the following, where P _current P for the number of Dwords that the current Stream has encoded _min For the number of dwords already encoded for the Stream with the slowest encoding rate, S is the pipeline depth.

P _current -P _min ≥S

For the example we just give, p1=5, p2=2, and s=3 at the 8 th Cycle, the condition of writing FakeCode to the output FIFO has been satisfied, so the post-improvement deadlock problem with fakeLiteral insertion has been solved.

From the above analysis, it can be seen that the P value of each Stream actually corresponds to the number of inputs. Meanwhile, when the deadlock condition is met, a plurality of codes FakeCodes corresponding to FakeLiteral are written into the output FIFO of the current Stream, so that the complete Cword output can be generated. To reduce the performance penalty caused by deadlock, we can generate FakeCode in advance and save it as configuration information of the codec in the output FIFO of each Stream, thus omitting the repetition coding of fakeLiteral.

Shuffle method. From the discussion of the pipeline balancing method in the previous section, we have recognized that input data for the same Stream may have multiple high frequency characters in succession, resulting in too many code rates differing between different streams, thereby causing a significant amount of blocking of the input and output FIFOs under synchronization constraints. Although the introduction of fakelikert solves the deadlock problem, it does not improve the performance waste caused by pipeline blocking.

In order to solve the problem, a method for grouping the streamers and shuffling input data is proposed herein, namely, two continuous inputs may come from two streamers different from the original one, so that the input randomness of a single Stream can be increased, the coding rate of each Stream is further balanced, and the utilization efficiency of hardware resources is improved.

Through the experiments of qualitative analysis, under the condition that the number of Stream is an integer multiple of 16, the following 4 Shuffle modes perform best:

adj4, the adjacent 4 streams are divided into a group, cyclic shuffling in the group is carried out, dword input sequentially by Stream1 is respectively input to stream1, 2, 3 and 4 before shuffling, and so on.

Adj16, the adjacent 16 streams are divided into a group, cyclic shuffling in the group is performed, dwords sequentially input by Stream1 are respectively input by

streams

1, 2, 3, & gt, 16 before shuffling, and so on.

Every4 groups by skipping 4 streams, performs cyclic shuffling in the group, and Dword input sequentially by Stream1 inputs streams 1, 5, 9, 13 before shuffling, respectively, and so on.

Every16 groups 16 streams, performs cyclic shuffling within the group, and Dword input sequentially by Stream1 inputs streams 1, 17, 33, 49 before shuffling, respectively, and so on.

The present chapter has completed the introduction of the algorithm scheme of the lossless compression method, and reviewing the evaluation indexes of the neural network compression listed in section 2.5, we can obtain the following conclusions: the hybrid coding is a lossless compression algorithm, any coded numerical value in the neural network can be restored, and the reduction of the neural network model precision (Accuracy) is avoided in principle; for the original data with a fixed data bit width and size, the code rate (BitRate) is closely related to the overall compression effect, as shown in the following formula.

BitRate＝DataWidth×CompressionRatio

The experiments and analyses of the next chapter will therefore be discussed mainly in terms of compression ratios (compression ratios), quantifying the compression effect of this hybrid code in the form of a graph, thus proving its advantages in the operation of the neural network, as well as its optimization of the compression of the neural network.

The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

As shown in fig. 17, the present invention further provides a neural network compression system based on hybrid coding, which includes:

the method comprises the steps of 1, acquiring neural network data to be compressed after quantization processing, and partitioning the neural network data to obtain a plurality of data blocks;

module 2, assigns a data stream to each of the data blocks for compression, the compression of each data stream comprising: performing run Cheng Quanling coding on the data block to obtain run compressed data, wherein the run Cheng Quanling coding only performs run coding on zero characters in the neural network data, performs normalized Huffman coding on the run compressed data, and reforms a coding result to obtain normalized Huffman coding as a compression result of the data block;

and a module 3, collecting the compression result of each data block as the compression result of the neural network data.

The neural network compression system based on hybrid coding, wherein the input and output of all data streams in the module 2 are synchronously performed.

The neural network compression system based on hybrid coding, wherein each data stream is provided with an independent input buffer memory and an output buffer memory, the input buffer memory is used for buffering the data block, and the output buffer memory is used for buffering the compression result of the data block.

The hybrid coding-based neural network compression system, wherein the module 2 comprises:

a module 21 for monitoring the data volume of each data stream after compression coding and judging whether P is satisfied _current -P _min If yes, writing virtual codes corresponding to the virtual characters into an output buffer of the current data stream, otherwise, executing the module 21 again; wherein P is _current Input queue width, P, encoded for current data stream _min The input queue width already encoded for the data stream with the slowest encoding rate, S is the pipeline depth of the data stream.

The neural network compression system based on mixed coding, wherein the data volume of writing virtual coding into the output buffer memory of the current data stream is P _current -P _min 。

The hybrid coding-based neural network compression system, wherein the module 21 comprises: writing virtual characters into an output buffer of the current data stream to obtain third intermediate data; judging whether the character identical to the virtual character in the third intermediate data is the original character in the output buffer of the current data stream, if so, replacing the character identical to the virtual character in the third intermediate data with the virtual code, and adding a flag bit representing that the character is the original character at the back, otherwise, replacing the character identical to the virtual character in the third intermediate data with the virtual code, and adding a flag bit representing that the character is the replacement character at the back.

The hybrid coding-based neural network compression system, wherein the module 2 comprises: the input buffer memory with the same number as the data flow is set for buffering the data blocks, and each data flow is provided with an output buffer memory for buffering the compression result of the data blocks.

The hybrid coding-based neural network compression system, wherein the module 2 comprises: the data stream randomly selects an input buffer and retrieves the data block therefrom.

The hybrid coding-based neural network compression system, wherein the module 2 comprises: the data stream selects an input buffer according to a preset rule and obtains the data block therefrom.

The hybrid coding-based neural network compression system, wherein the run bit width of the run Cheng Quanling code is 2 bits.

The hybrid coding-based neural network compression system, wherein the run Cheng Quanling coding further comprises: performing run-length coding on zero data in the neural network data to obtain first intermediate data; replacing the coding segment with the run length of 3 of the first intermediate data with a zeroLiteral character to obtain second intermediate data; judging whether the character identical to the zeroLiter character in the second intermediate data is the original character in the neural network data, if so, replacing the character identical to the zeroExtra character in the second intermediate data with the zeroExtra character, and adding a flag bit representing that the character is the original character at the same time, otherwise, replacing the character identical to the zeroLiter character in the second intermediate data with the zeroExtra character, and adding a flag bit representing that the character is the replacement character at the same time.

The neural network compression system based on hybrid coding, wherein the zeroLiter character and the zeroExtra character are two characters with the lowest occurrence frequency in the neural network data respectively.

The neural network compression system based on hybrid coding, wherein the coding result is a Huffman tree.

and reforming the Huffman tree from top to bottom, namely moving leaf nodes in the same-level nodes of the Huffman tree to the left side of the binary tree.

The neural network compression system based on hybrid coding, wherein the neural network data are weight data and neuron input data in the neural network operation.

In the hybrid coding-based neural network compression system, if the canonical Huffman code is greater than or equal to the neural network data, the canonical Huffman code is abandoned, and the neural network data is directly used as the compression result.

Claims

1. The neural network quantization compression method based on the inter-stream data shuffling is characterized by comprising the following steps:

wherein, this step 2 includes: setting input caches with the same number as the data streams, wherein each data stream is provided with an output cache for caching the compression result of the data block; the data flow randomly selects an input buffer or selects the input buffer according to a preset rule, and the data block is obtained from the input buffer;

wherein the run bit width of the run Cheng Quanling code is 2 bits;

the run Cheng Quanling code further comprises: performing run-length coding on zero data in the neural network data to obtain first intermediate data; replacing the coding segment with the run length of 3 of the first intermediate data with a zeroLiteral character to obtain second intermediate data; judging whether the character identical to the zeroLiter character in the second intermediate data is an original character in the neural network data, if so, replacing the character identical to the zeroExtra character in the second intermediate data with the zeroExtra character, and adding a flag bit representing that the character is the original character at the same time, otherwise, replacing the character identical to the zeroLiter character in the second intermediate data with the zeroExtra character, and adding a flag bit representing that the character is the replacement character at the same time;

The coding result is a Huffman tree; the step 2 comprises the following steps:

reforming the Huffman tree from top to bottom, specifically, moving leaf nodes in the same level of nodes of the Huffman tree to the left side of the binary tree;

each data stream is provided with an independent input buffer memory and an output buffer memory, wherein the input buffer memory is used for buffering the data block, and the output buffer memory is used for buffering the compression result of the data block;

the step 2 comprises the following steps:

step 21, monitoring the data volume of each data stream after compression coding, and judging whether P is satisfied _current -P _min If yes, writing virtual codes corresponding to the virtual characters into an output buffer of the current data stream, otherwise, executing the step 21 again; wherein P is _current Input queue width, P, for current data flow _min The input queue width of the data stream with the slowest coding rate is defined, and S is the pipeline depth of the data stream;

writing virtual coded data quantity P into output buffer of current data stream _current -P _min ；

The step 21 includes: writing virtual characters into an output buffer of the current data stream to obtain third intermediate data; judging whether the character identical to the virtual character in the third intermediate data is the original character in the output buffer of the current data stream, if so, replacing the character identical to the virtual character in the third intermediate data with the virtual code, and adding a flag bit representing that the character is the original character at the back, otherwise, replacing the character identical to the virtual character in the third intermediate data with the virtual code, and adding a flag bit representing that the character is the replacement character at the back.

2. A neural network quantization compression system based on inter-stream data shuffling, comprising:

wherein the module 2 comprises: setting input caches with the same number as the data streams, wherein each data stream is provided with an output cache for caching the compression result of the data block; the data flow randomly selects an input buffer or selects the input buffer according to a preset rule, and the data block is obtained from the input buffer;

Wherein the run bit width of the run Cheng Quanling code is 2 bits;

the coding result is a Huffman tree; the module 2 comprises:

A module 21 for monitoring the data volume of each data stream after compression coding to determine whether P is satisfied _current -P _min If yes, writing virtual codes corresponding to the virtual characters into an output buffer of the current data stream, otherwise, executing the module 21 again; wherein P is _current Input queue width, P, for current data flow _min The input queue width of the data stream with the slowest coding rate is defined, and S is the pipeline depth of the data stream;

The module 21 comprises: writing virtual characters into an output buffer of the current data stream to obtain third intermediate data; judging whether the character identical to the virtual character in the third intermediate data is the original character in the output buffer of the current data stream, if so, replacing the character identical to the virtual character in the third intermediate data with the virtual code, and adding a flag bit representing that the character is the original character at the back, otherwise, replacing the character identical to the virtual character in the third intermediate data with the virtual code, and adding a flag bit representing that the character is the replacement character at the back.