CN114697673A

CN114697673A - Neural network quantization compression method and system based on inter-stream data shuffling

Info

Publication number: CN114697673A
Application number: CN202011607729.7A
Authority: CN
Inventors: 何皓源; 王秉睿; 支天; 郭崎
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-07-01
Anticipated expiration: 2040-12-30
Also published as: CN114697673B

Abstract

The invention provides a neural network quantization compression method and system based on inter-stream data shuffling, which comprises the steps of obtaining neural network data to be compressed after quantization processing, and blocking the neural network data to obtain a plurality of data blocks; distributing a data stream to each data block for compression, and randomly selecting an input cache for the data stream or selecting the input cache according to a preset rule to obtain a compression result of the data block; collecting the compression result of each data block as the compression result of the neural network data; the invention avoids continuously reading data from the same input buffer, increases the input randomness of a single data stream, and further balances the coding rate of each data stream, thereby improving the utilization efficiency of hardware resources.

Description

Neural network quantization compression method and system based on inter-stream data shuffling

Technical Field

The invention relates to the field of neural network operation, in particular to a neural network quantization compression method and system based on inter-stream data shuffling.

Background

In recent years, the rapid development of artificial intelligence has become a major driving force for productivity development and technical innovation in the context of double outbreaks of information content and hardware power. As a main branch of the artificial intelligence technology, in order to further improve the model accuracy, the neural network algorithm has encountered technical bottlenecks of complex structure, large parameter and computation amount, and limits the application of the neural network model in the scene of pursuing throughput and energy efficiency ratio, so that the computation efficiency becomes a main research target in the next stage. The most effective neural network compression method combines low precision and sparseness, can reduce the parameter quantity of the neural network to a certain extent, but cannot further mine data redundancy in a pruned and quantized neural network model.

Pruning (Pruning), as shown in fig. 1, may also be referred to as model sparsification, that is, a part of parameters with lower importance in the neural network is directly pruned to 0, and the calculations related to the part of parameters are also masked, and may be divided into single parameter Pruning and structured parameter Pruning according to the basic unit of Pruning.

Quantization (Quantization), as shown in fig. 2, may also be referred to as model low-ratio Quantization, that is, a method of low-precision representation reduces the number of bits required for storing part or all of data in a neural network model, thereby reducing parameters and simplifying most of floating point operations, so as to achieve the purpose of model compression.

The rationality of network quantization comes from the data distribution characteristics of the neural network itself, for example, the output neurons passing through the activation layer may contain many 0 s, and the absolute values of most weights fall within the [0,1] interval, so that a large amount of data overflow is avoided, and the accuracy of the model after low-accuracy quantization can be easily recovered in combination with the retraining process. On the other hand, in order to ensure the training convergence speed and precision of the neural network, deep learning development frameworks such as Caffe, tenserflow and Pytorch mainly use 32-bit or 64-bit floating point numbers, which is a waste in the hardware implementation of ASICs or FPGAs, and not only will bring more storage and computation overhead, but also will significantly increase the area and power consumption of the operation unit in the chip.

For reasoning (Inference) operation of most network models, low-precision data of about 8 bits can meet the requirement of network performance, so that the technical application of the industry boundary to neural network quantification is mature, and the TensorRT operation library of Nvidia and a plurality of deep learning acceleration architectures at home and abroad support and optimize neural network operation and storage of 8-bit fixed point number.

Data reduction (DataReduction), as shown in fig. 3, is a method of replacing the original data with a small portion of data to reduce the number of parameters in the neural network model. Data reduction, originally a concept in data preprocessing, has been no longer limited to reducing the data set size of neural networks, but rather is gradually used to explore neural network parameters and even network structure compression methods. In the form of compressed data, data reduction may be further divided into two categories, dimensional reduction (Dimensionalreducation) and numerical reduction (Dimensionalreducation).

The goal of dimension reduction is to reduce the number of independent variables of learnable parameters, for example, the parameters in the model are subjected to wavelet transform, principal component analysis or discrete cosine transform, and then compression operations such as quantization or entropy coding are performed by using the transform domain characteristics of the data, and the redundancy of the neural network model can be reduced from the perspective of the transform domain by combining methods such as pruning, quantization and retraining. The idea is similar to a processing method adopted for a direct component in JPEG coding of image compression, and also utilizes data locality in a neural network, but FeatureMap has essential difference from a natural image, and a compression strategy for a transform domain parameter needs to be carefully explored to avoid unrecoverable loss of model precision after inverse transformation, so that the whole compression method is invalid.

Number reduction focuses on methods that more intuitively reduce the number of parameters, such as linear regression, clustering, sampling, etc., to achieve parameter sharing within a certain range, where the sharing granularity may be parameter matrices, channels, convolution kernels, layers, or even sub-networks.

By combining the existing research results, pruning, quantification and data reduction usually occur in a comprehensive neural network compression method at the same time, and a direction is provided for the subsequent hardware optimization to a certain extent. This phenomenon is because parameters in the model must be grouped according to a certain evaluation index in the process of forming the parameter sharing strategy, so that the quantity reduction and the foregoing pruning and quantization method have a natural coupling relationship, and the granularity of parameter sharing can be very naturally used as the granularity of pruning and quantization.

On the other hand, the traditional data compression method, especially the image compression method (entropy coding, transform coding, run coding, mixed coding and the like) is widely applied to the neural network compression method, is used for carrying out the last step of compression on various neural network models which are subjected to pruning, quantization and data reduction, can save 20% -30% of storage overhead, can improve the compression ratio, and can also obviously improve the calculation performance of reasoning and training after being deployed on a related hardware acceleration architecture.

In addition, the neural network compression research also extends the meaning of lossless compression, namely, parameters in the neural network are distorted no matter pruning, quantization or data reduction, although the precision of the compressed model can be restored through retraining, the obtained model data has irreversible change, so that the interpretability of the neural network algorithm is further reduced, and the definition of the lossless compression is different from that in the classical data compression. Therefore, it should be noted that the lossless compression mentioned herein does not refer to a neural network compression method in which the model accuracy is lossless, but a data compression method in which the data itself can be restored without any distortion.

Pruning and quantized data distribution. The data distribution in the neural network conforms to the principle of large numbers, so that the data in the pre-trained model approximately presents a normal distribution with 0 as the mean value. Due to the reduction of the numerical resolution, the quantization with low precision may become a discrete normal distribution, and overflow occurs on the numerical value beyond the representation range, but the distribution envelope remains unchanged, taking the last convolution layer in AlexNet, Conv5 as an example, and fig. 4 shows the weight distribution of the layer after quantization to 8-bit fixed point number. Meanwhile, pruning may cause a part of the parameters with an absolute value close to 0 to be clipped to 0, and as shown in fig. 5, a hole in the distribution may appear at a position close to 0, and the data therein are all concentrated on a peak equal to 0.

Under the dual constraints of pruning and quantification, although retraining can restore the model accuracy, the normal distribution of the pre-trained model cannot be found back, but the pre-trained model presents a bimodal distribution as shown in fig. 6. From this phenomenon, we can see that the compression method combining pruning, quantization and retraining not only introduces a large number of 0, but also deeply changes the data distribution in the neural network.

The Huffman coding can reach the entropy limit on the character sequence with the frequency distribution conforming to 2-n, meanwhile, the run-length coding has no second-order coding, and by combining the change of the data distribution, the following two parts to be optimized can be excavated:

(1) compared with the original normal distribution, the appearance frequency of each digit after retraining is more average, which may cause the average code length of the huffman coding to increase, so that the data must be pre-coded before entropy coding to improve the appearance frequency of the input characters of the huffman coding.

(2) Although retraining can recover a part of the clipped parameters, there still exist a large number of 0 in the weight values, and these 0 are not necessarily continuous, so an encoding method capable of compressing these 0 comprehensively is needed, thereby increasing the compression ratio of the clipped model.

The neural network compression method is not only an improvement of a neural network algorithm, but also a design and application of a data compression method, so that the evaluation of the effect of the neural network compression needs to be started from two aspects, namely precision and compression ratio, and in order to measure the influence on hardware bandwidth more reasonably, code rate should be introduced as an evaluation index.

Accuracy (Accuracy), the various neural network compression methods introduced above all modify the neural network, either numerically or structurally, which inevitably affects the model's performance on its target task, and whether the Accuracy of the pre-training can be restored depends on whether the damage to the model caused by the neural network compression algorithm is reversible. Therefore, after compression and a series of retraining operations, the recognition accuracy of the neural network is a primary index for evaluating the compression method, especially compared with a pre-trained model. For target detection, the accuracy rate is shown as Top-1 and Top-5, and mAP and the like need to be considered on the tasks of pattern recognition and scene segmentation.

The compression ratio (compression ratio) is the ratio of the original data to the compressed data, which is an intuitive evaluation of the data compression method and is also a common evaluation criterion for all compression algorithms.

The significance of the code rate (BitRate), which is the average number of bits required to represent one character in compressed data, is similar to the weighted average code length in huffman coding, and a lower code rate indicates that the more the current compression method fits the original data distribution, the less the hardware bandwidth required to carry the compressed data.

Based on the three evaluation indexes, the invention provides a brand-new lossless compression algorithm, and aims to design a coding method more suitable for a neural network model under the reference of the prior art foundation and an evaluation system, so as to substantially optimize the neural network compression method.

Disclosure of Invention

The invention provides a coding and decoding method for a plurality of parallel streams, a Shuffle method for improving the compression ratio by stream data shuffling, a FakeLiterAlInsert method for avoiding deadlock by balancing the compression speed between streams, and finally provides a high-efficiency neural network lossless compression coding and a hardware pipeline implementation scheme thereof based on standardized HSF (Huffman-Shannon-Fano) coding, run-length coding and further all-zero replacement coding.

Specifically, the invention provides a neural network quantization compression method based on inter-stream data shuffling, which comprises the following steps:

step 1, obtaining neural network data to be compressed after quantization processing, and blocking the neural network data to obtain a plurality of data blocks;

step 2, allocating a data stream to each data block for compression, wherein the compression of each data stream comprises the following steps: run-length all-zero coding is carried out on the data block to obtain run-length compressed data, wherein the run-length all-zero coding only carries out run-length coding on zero characters in the neural network data, standardized Huffman coding is carried out on the run-length compressed data, and a coding result is reformed to obtain the standardized Huffman coding which is used as a compression result of the data block;

step 3, collecting the compression result of each data block as the compression result of the neural network data;

wherein, the step 2 comprises: setting input buffers with the same number as the data streams for buffering the data blocks, wherein each data stream is provided with an output buffer for buffering the compression results of the data blocks; the data stream randomly selects an input buffer or selects the input buffer according to a preset rule, and obtains the data block from the input buffer.

The neural network quantization compression method based on the inter-stream data shuffling is characterized in that each data stream is provided with an independent input buffer and an independent output buffer, the input buffer is used for buffering the data block, and the output buffer is used for buffering the compression result of the data block.

The neural network quantization compression method based on the inter-stream data shuffling comprises the following steps of 2:

step 21, monitoring the compressed and encoded data amount of each data stream, and judging whether P is satisfied_current-P_minS or more, if yes, writing the virtual code corresponding to the virtual character into the output cache of the current data stream, otherwise, executing the step 21 again; wherein P is_currentFor the input queue width, P, of the current data stream_minThe input queue width of the data stream with the slowest coding rate, and S is the pipeline depth of the data stream.

The neural network quantization compression method based on the inter-stream data shuffling is characterized in that the data volume of writing the virtual codes into the output buffer of the current data stream is P_current-P_min。

The neural network quantization compression method based on the inter-stream data shuffling, wherein the step 21 comprises the following steps: writing the virtual character into an output cache of the current data stream to obtain third intermediate data; and judging whether the character in the third intermediate data, which is the same as the virtual character, is the original character in the output cache of the current data stream, if so, replacing the character in the third intermediate data, which is the same as the virtual character, with the virtual code, and meanwhile, increasing a flag bit indicating that the character is the original character, otherwise, replacing the character in the third intermediate data, which is the same as the virtual character, with the virtual code, and meanwhile, increasing a flag bit indicating that the character is the replaced character.

The invention also provides a neural network quantization compression system based on the inter-stream data shuffling, which comprises the following components:

the module 1 is used for acquiring neural network data to be compressed after quantization processing, and blocking the neural network data to obtain a plurality of data blocks;

a module 2, configured to allocate a data stream to each of the data blocks for compression, where the compression of each data stream includes: run-length all-zero coding is carried out on the data block to obtain run-length compressed data, wherein the run-length all-zero coding only carries out run-length coding on zero characters in the neural network data, standardized Huffman coding is carried out on the run-length compressed data, and a coding result is reformed to obtain the standardized Huffman coding which is used as a compression result of the data block;

a module 3, configured to aggregate compression results of the data blocks, as compression results of the neural network data;

wherein, this module 2 includes: setting input buffers with the same number as the data streams for buffering the data blocks, wherein each data stream is provided with an output buffer for buffering the compression results of the data blocks; the data stream randomly selects an input buffer or selects the input buffer according to a preset rule, and obtains the data block from the input buffer.

The neural network quantization compression system based on the inter-stream data shuffling is characterized in that each data stream is provided with an independent input buffer and an independent output buffer, the input buffer is used for buffering the data block, and the output buffer is used for buffering the compression result of the data block.

The neural network quantization compression system based on the inter-stream data shuffling, wherein the module 2 comprises:

a module 21, configured to monitor the amount of compressed and encoded data of each data stream, and determine whether P is satisfied_current-P_minIf yes, writing the virtual code corresponding to the virtual character into the output buffer memory of the current data stream, otherwise executing the module 21 again; wherein P is_currentFor the input queue width, P, of the current data stream_minThe input queue width of the data stream with the slowest coding rate, and S is the pipeline depth of the data stream.

The neural network quantization compression system based on the inter-stream data shuffling is characterized in that the data volume of writing the virtual codes into the output buffer of the current data stream is P_current-P_min。

The neural network quantization compression system based on the inter-stream data shuffling, wherein the module 21 comprises: writing the virtual character into an output cache of the current data stream to obtain third intermediate data; and judging whether the character in the third intermediate data, which is the same as the virtual character, is the original character in the output cache of the current data stream, if so, replacing the character in the third intermediate data, which is the same as the virtual character, with the virtual code, and meanwhile, increasing a flag bit indicating that the character is the original character, otherwise, replacing the character in the third intermediate data, which is the same as the virtual character, with the virtual code, and meanwhile, increasing a flag bit indicating that the character is the replaced character.

According to the scheme, the invention has the advantages that:

1. aiming at the characteristic that the quantized neural network data has sparsity, the run-length coding is improved, run-length all-zero coding is provided, and the neural network data can be efficiently and losslessly compressed;

2. the run-length all-zero coding of the invention comprises second-order character replacement, further improves the compression efficiency, reduces the number of 0 in data, and reserves more compression space for the subsequent Huffman coding;

3. the Huffman tree is reformed from top to bottom, a complete Huffman tree structure is saved, and the complexity of table look-up operation is obviously reduced;

4. virtual codes corresponding to the virtual characters are written into the output cache of the data stream with the low speed, so that the compression speed among streams is balanced, the coding difference among pipelines is reduced, and deadlock is avoided.

5. The data stream randomly selects the input buffer or selects the input buffer according to the preset rule, so that the continuous reading of data from the same input buffer is avoided, the input randomness of a single data stream is increased, the coding rate of each data stream is further balanced, and the utilization efficiency of hardware resources is improved.

Drawings

FIG. 1 is a diagram of network pruning and parameter recovery;

FIG. 2 is a diagram of network parameter progressive quantization;

FIG. 3 is a parameter sharing and cluster center trimming diagram;

FIG. 4 is a diagram of quantized Conv5 level weight distribution;

FIG. 5 is a Conv5 level weight distribution graph after quantization and pruning;

FIG. 6 is a Conv5 level weight distribution graph after retraining;

FIG. 7 is a diagram of the first stage encoding results;

FIG. 8 is a diagram of second stage encoding rules;

FIG. 9 is a diagram of the second stage encoding results;

FIGS. 10a and 10b are exemplary diagrams of normalized Huffman trees;

FIG. 11 is a diagram of data splitting in parallel compression;

FIG. 12 is a diagram of multiple data stream codecs;

FIG. 13 is a diagram of a Last bit;

FIG. 14 is a diagram of a Head-Data storage format;

FIG. 15 is a schematic diagram of the generation of a deadlock;

FIG. 16 is a FakeLiteral replacement rule diagram;

FIG. 17 is a flow chart of the present invention.

Detailed Description

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The method aims at optimizing the neural network compression method, analyzes the distribution characteristics of data in a neural network model after pruning and quantization, provides a lossless compression algorithm combining entropy coding, run-length coding and all-zero coding, fully explores the deployment form of the lossless compression algorithm on hardware, finally designs and realizes an NNcodec neural network coding and decoding simulator, performs longitudinal and transverse comparison experiments on 7 relatively mature neural network models at present on the basis, proves the optimization effect of the hybrid coding on the neural network compression method from a software level, and also provides an easily-realized hardware design scheme.

Run-length encoding, in which a combination of characters plus runs is used to replace a plurality of repeated characters, the number of times that the same character appears continuously is called run, for example, the character sequence (AABBCCCD) is compressed into (A2B 2C3D 1) after being run-length encoded.

Run-length coding is an effective data compression method, and a reasonable run-length bit width needs to be designed in the concrete implementation, so that repeated compression cannot be caused by too short run length, and the compression ratio cannot be reduced by too long run length. Furthermore, we can observe in principle that this method cannot be used on data that has already been run-length encoded.

A reasonable run bit width should first be chosen for run-length all-zero encoding. Considering that the present neural network compression model adopts more fixed point numbers with 8bit or lower bit width for quantization, the invention adopts shorter 2bit run, that is, at most 4 continuous 0 can be compressed into the form of digital 0 and run 3. Taking the encoding sequence {1,0,0,0,0,2,0,255,0,0} to be encoded in the 8-bit fixed point number format as an example, the encoding result in the first stage is shown in fig. 7.

The total bit width of the data to be coded is 80 bits, the total code length after the first stage of compression is 64 bits, and if the total code length obtained by adopting the traditional run-length coding is 72 bits, considering that a great number of 0 s continuously appear in the compressed neural network model, the nonzero data without the run can be considered as an improvement of the run-length coding in the neural network calculation.

Special character replacement and Extra bits. In order to further improve the compression ratio of all-zero data and simultaneously not introduce excessive overhead, a certain character (for example, 255) with extremely low probability of appearing in the original data is selected as zeroLiteral in the second stage to replace a coded segment with the run length of 3 in the existing data; to distinguish from a character numerically equal to zeroLiteral, it is also necessary to replace it with another character with a very low probability of occurrence (e.g., 254), called zeroExtra, with a 1bit Extra bit added to it to distinguish from a character numerically equal to zeroExtra, as shown in FIG. 8.

It is noted that the method of distinguishing by adding 1bit directly after zeroLiteral is not selected here, but the above-mentioned second-order character substitution is adopted because the frequency of appearance of characters equal in value to zeroLiteral and zeroExtra is very low, and it is more reasonable to perform character substitution once again and then distinguish by Extra bits than the method of adding 1bit to 4 consecutive 0-corresponding codes. The result of the second stage is shown in fig. 9.

So far, we finish run-length all-zero encoding once, and can summarize the decoding flow under the general condition:

(1) restoring special characters, and if zeroLiteral is checked in the compressed data, replacing the compressed data with a number 0 and a run 3; if zeroExtra is checked in the compressed data, the following Extra bit is checked, and if the Extra bit is 0, zeroLiteral is replaced, otherwise, zeroExtra is replaced.

(2) And run-length decoding, namely performing run-length decoding on 0 in the data restored by the special characters, expanding the subsequent 2-bit run length, and writing 0 with the corresponding number into the result.

The above is the description of the run-length all-zero coding algorithm, and finally the 80-bit data to be coded in the example is compressed into 49 bits. In addition, it can directly replace 0 appearing continuously with zeroLiteral at a compression ratio of 4 times, so that more compression space is reserved for subsequent Huffman coding at the position where 0 appears in data.

And normalizing the Huffman coding. The huffman codes corresponding to the same data distribution are not unique, because the nodes with the same frequency may appear in the process of constructing the huffman tree, and in order to better apply the huffman codes, a fixed and efficient method for constructing the huffman tree is needed, and the unique huffman codes generated by the method are called normalized huffman codes.

Huffman tree reformation. In particular, the present invention employs HSF coding as the normalized huffman coding. The main idea of the coding is to reform the Huffman tree from top to bottom, leaf nodes in the same level of nodes are preferentially moved to the left side of the binary tree, all codes can be obtained by adding 1 and adding 1 for left movement on the premise of not changing frequency distribution, and then complex binary tree traversal or complete table look-up operation is replaced by comparison and addition operation, so that storage and calculation cost required by coding and decoding is greatly reduced. As shown in FIG. 10a, FIG. 10b, and Table 1, given the sequence of characters { u1, u2, u3, u4, u5} and any two sets of Huffman trees, their normalized forms can be found using the above-described reconstruction method.

Table 1 normalized huffman coding example

Redefining the code table. It can be seen that each huffman tree corresponds to its normalized form, and the normalized codewords have a hardware-friendly operation rule: the same code length +1, and the left shift of 1bit after different code length +1, we can use this rule to redefine the code table, thus completely saving the whole Huffman tree structure, and significantly reducing the complexity of table lookup operation.

For the above one-section coding sequence (a) as an example, the HSF coding/decoding process requires the following code tables:

(1) CharTable, all the characters to be coded are arranged according to the descending order of the occurrence frequency, in case (a), they are { u1, u4, u5, u2, u3}, and the coding and decoding can be multiplexed.

(2) LenTable, all the effective code length ascending order, the HSF code in example (a) only has 2bit and 3bit, the corresponding LenTable is {2,3}, the coding and decoding can be multiplexed.

(3) In the RangeTable, the character corresponding to the last codeword of each effective code length is the index in the CharTable, and in the example (a), the last codewords of 2bit and 3bit correspond to u5 and u3, respectively, so the RangeTable is {2,4}, and is only used in the encoding stage.

(4) The value of the last codeword of each effective code length, i.e. the last codewords of 2 bits and 3 bits in (a), of the limit table is 10 and 111, respectively, so that the limit table is {2,7}, and is only used in the decoding stage.

(5) The BaseTable, the LimittTable minus the RangeTable, in case (a), the BaseTable is {0,3}, and the codec can be multiplexed.

On the basis, the flow of generating the HSF code of the character u4 and the above code table in the given example (a) can be divided into the following three steps:

lookup, visit CharTable, get rank of u4, i.e. rank (u4) ═ 1;

compare, having access to RangeTable, gets the first item index greater than or equal to rank (u4), since rank (u4) ≦ 2, index (u3) is 0;

add, access BaseTable and LenTable to obtain base value base and code length len corresponding to subscript index, the sum of base and rank is the numerical value of the codeword, and HSF coding of the character u4 can be obtained by combining len: since base (u4) is 0, len (u4) is 2, and code (u4) is 0+1 is 1, the final encoding result is 01.

Correspondingly, we can derive the decoding flow for parsing the first character u4 from a series of HSF code streams such as 01 xxx:

the Compare accesses the Limit Table and the LenTable, and the index accumulated from 0 is traversed until the limit is more than or equal to the front len bit of the code stream, because the limit value of 0 in the Limit Table is more than or equal to the front 2bit of the code stream, the index is 0, the limit is 2, the len is 2, and the code is 1;

sub, accessing the BaseTable, and subtracting a base value corresponding to the index from the code value to obtain a character sequence, i.e. rank is 1-0 is 1;

and looking up, accessing the CharTable to obtain the characters corresponding to the rank, so that the final decoding result is the item with rank 1 in the CharTable, that is, the character u 4.

The introduction of the standardized Huffman coding algorithm shows that the HSF coding can simplify the storage and operation structure of the Huffman coding, and meanwhile, the coding and decoding processes are naturally divided into 3-level pipelines, so that an efficient and reasonable implementation scheme is provided for hardware deployment.

A method for parallel compression of multiple data streams. Combining the run-length all-zero coding and the normalized huffman coding, we already have the algorithm basis of the hybrid coding proposed by the present invention, and next need to discuss how to improve the compression efficiency of the coding method on hardware.

In the context of parallel computing (ParallelComputing), most serial operations use Pipeline (Pipeline) techniques to achieve time parallelism. On the other hand, with the development of multi-core architectures, techniques such as Single Instruction Multiple Data (SIMD) for data level parallelism have also come to be used in various data compression methods. Inspired by similar methods, a parallel compression method based on data partitioning and pipeline coding and decoding is proposed herein, and its hardware implementation is considered and optimized.

Pipeline level division and data blocking. From the pre-estimates of the complexity of each serial operation, it can be considered that the existing compression method can be divided into 5-stage pipelines in hardware, as shown in table 2. And because the coding and decoding operation has symmetry, part of operation and storage structures can be multiplexed in the compression and decompression stages, so that the utilization rate of a hardware unit is improved.

TABLE 2 codec pipeline stage partitioning

Based on the pipeline codec structure, we need to Block (Block) the fixed-size batch (Chunk) data to be compressed, as shown in fig. 11. While each block is assigned a data Stream (Stream) for compression or decompression, wherein each data Stream comprises the hardware units required by the 5-stage pipeline, and all data blocks and codec streams together perform the parallel compression and decompression operation proposed herein.

In order to fully consider various influencing factors in subsequent experiments and further scientifically evaluate the performance of the parallel compression and decompression method, the independent variables in the codec structure are determined in advance, and the basic granularity of input and output is named on the basis.

Stream, i.e. the number of data streams, it is certainly desirable from the viewpoint of improving the processing speed that more data streams are better, but this doubles the area power consumption overhead of the codec, so the Stream number is a parameter that needs to be balanced by experimental analysis.

Chunk, that is, the size of the batch of data to be compressed, the most intuitive choice is to use all the data to be compressed as one Chunk, but this cannot perform higher-level data parallel by copying multiple codec structures (for example, because of hardware limitations and data locality, a better compression effect cannot be obtained by simply spreading more streams), so we use Chunk size as one of the controlled variables in the experiment.

Dword, i.e. the width of the input queue of each Stream, determines the Block size together with the number of streams, and usually a value is selected that is comparable to the width of the type of data to be compressed, so as to avoid performance bottleneck or waste of hardware resources in the level 1 pipeline due to the serial nature of run-length coding.

Cword, the output queue width of each Stream, can be estimated as a reasonable value by dividing Dword by the expected compression ratio.

Fig. 12 shows a specific codec structure, where Dword is 16 bits, Cword is 4 bits, and there are 64 streams in total, that is, blocks capable of processing 1024 bits in parallel, so as to obtain 256-bit compressed data. It is noted that the Stream inputs and outputs are synchronous and internally asynchronous, which is to ensure synchronous operations at Block granularity, and facilitate subsequent hardware units to store compressed data generated simultaneously by multiple streams per cycle with a fixed data size.

In order to solve the aforementioned problem that Chunk cannot be removed by Block, the pipeline stage in each Stream needs to add 1-bit Last bit in addition to necessary codec hardware, and allows a part of streams to flush the pipeline in advance and end the compression process under the constraint of input and output synchronization, and when the Last bits of the output ports of all streams are 1, it represents that the compression of the current Block is completed, as shown in fig. 13.

In addition, because a variable length coding method is adopted, namely the code length generated after compression is uncertain, the code word after character compression with small occurrence probability may exceed the data bit width of the code word, and the compressed data size may exceed the size of Block in extreme cases, thereby causing compression failure. Therefore, the periphery of the codec structure should also add a Bypass (Bypass) design, i.e. an alternative operation after the current Block compression fails.

Specifically, if the compressed data obtained at the output end exceeds the size of the current Block, the compression is abandoned, the original data is directly used as the compression result of the current Block, and the operation of each stage of pipeline is skipped in the decompression stage. The next subsection will describe the storage format that enables the compressed Bypass and parameterized pipeline.

The Head-Data storage format. In order to implement the Bypass operation and other more flexible encoding and decoding modes, as described above, whether the operation in each stage of pipeline is enabled is configured by parameters, so as to conveniently evaluate the effect of each stage of operation in the whole scheme while improving the compression efficiency, a Head-Data storage format for compressing Data is proposed, that is, the compressed Data is divided into a header (Head) and a Data (Data), wherein the address offset, the Data length and the compression configuration (for example, enable of each stage of operation, Bypass enable and the like) of the Data part are stored in the Head, as shown in fig. 14.

On the basis, the compression and decompression operation at the instruction level needs to add a configuration stage before the codec completes the encoding operation, and signals such as a code table (CharTable and the like) and pipeline level enable and the like required by each level of pipeline operation are acquired from the Head part of the compressed Data during the decoding operation, and then signals such as the pipeline level enable and Bypass enable are acquired, so that a correct decoding flow is executed on the Data part, and the decompression operation is completed.

In addition, because the address offset is used to locate the Data portion, the Head-Data storage format also ensures the correctness of the operation of outputting Data to be filled to 256 bits after a part of Stream in fig. 13 finishes compression in advance, that is, each Block corresponds to compressed Data having its accurate storage address, so that adding some meaningless Data at the end of the Data portion does not affect the effective execution of the decompression flow.

FakeLiteral is balanced with the pipeline. Since the codec structure has the characteristics of fixed-length input, variable-length output, port synchronization and internal asynchronization, under the condition that the compression speed among a plurality of streams is extremely unbalanced, the synchronization constraint of an input/output queue can cause the occurrence of deadlock (deadbook).

Firstly, it is clear that the input and output of all streams are performed synchronously, that is, all the first-stage pipelines are empty for input, all the last-stage pipelines are full for output, and the middle stages can independently execute operations and wait only when the previous stage is empty or the next stage is full. Without loss of generality, we assume an example of Stream total 2, Dword width 8, Cword width 8, and FIFO depth 4, and fig. 15 gives the cause of the deadlock.

To simulate the extreme case of imbalance between multiple streams, we assume that Stream1 encodes 8-bit data as 1bit, and Stream2 encodes 8-bit data as 16 bit; through the input of the first 3 cycles, the data in the pipeline is filled, the output is generated from the 4 th Cycle, and the 4 items of the output FIFO of the Stream2 are full at the 5 th Cycle, but because only 2 bits are available in the output FIFO of the Stream1 at present and a complete Cword is not provided, the Stream2 blocks from the moment, and all pipeline stages cannot receive the input; after Cycle 8, all pipeline stages in Stream1 are drained, but only 5 bits in the output FIFO are still insufficient to produce an output, and the input is also blocked by the first stage pipeline of Stream2 and therefore cannot be input, thus entering a deadlock condition.

By observing that we can find that, by the time of deadlock generation, Stream1 completes the encoding of 5 dwords, while Stream2 completes only 2 dwords, and the two different dwords are buffered as intermediate data in the pipeline stage of Stream2, thereby causing the whole blocking of the pipeline. Under the guidance of this conclusion, we can properly insert the empty bubble in the Stream with the fastest coding speed, and narrow the gap between pipelines, thereby avoiding generating deadlock.

To address this problem, the concept of fakelateral is presented herein, namely the "dummy character": the data is input as space occupying bubbles during encoding, and is directly discarded during decoding and not used as valid data. Similar to the special character zeroLiteral introduced in section 3.2, FakeLiteral also uses a character with very low frequency of occurrence in the data to be encoded to represent, and also employs a second-order replacement and Extra bit addition strategy to distinguish from real characters, as shown in FIG. 16 (FakeLiteral and FakeExtra are given as 253 and 252, respectively), and the identification and discarding operations of FakeLiteral in the decoding stage can also be done at the stage of Alter.

From the just-derived, it can be seen that the conditions under which deadlock occurs are as follows, where P_currentNumber of Dword already coded for the current Stream, P_minThe number of Dword already coded by the Stream with the slowest coding rate, and S is the pipeline depth.

P_current-P_min≥S

For the example we have just given, at Cycle 8, P1 is 5, P2 is 2, and S is 3, the conditions for writing FakeCode to the output FIFO have been met, so the improved deadlock problem of fakeLiteral insertion has been solved.

From the above analysis, it can be seen that the P value of each Stream actually corresponds to the number of inputs. Meanwhile, when the deadlock condition is met, a plurality of FakeLiteral codes corresponding to FakeLiteral are written into the output FIFO of the current Stream, so that the complete Cword output can be generated. In order to reduce the performance loss caused by deadlock, FakeCode can be generated in advance and stored in the output FIFO of each Stream as the configuration information of the codec, so that repeated coding of FakeLiteral is omitted.

Shuffle method. From the previous section discussion of pipeline balancing methods, we have recognized that input data for the same Stream may have multiple high frequency characters in succession, resulting in coding rates that differ too much between different streams, causing a large amount of blocking of the input and output FIFOs under synchronization constraints. Although the introduction of fakelateral solved the deadlock problem, it did not improve the performance waste caused by pipeline blocking.

In order to solve the problem, the method for grouping the streamers and shuffling the input data is provided, namely, two times of input are possible to be from two streamers different from the original, so that the input randomness of a single streamer can be increased, the coding rate of each streamer is further balanced, and the utilization efficiency of hardware resources is improved.

Through qualitative analysis experiments, under the condition that the number of streams is integral multiple of 16, the following 4 Shuffle modes are best performed:

adj4, dividing adjacent 4 streams into a group, circularly shuffling in the group, inputting Dword sequentially input by streams 1 into

streams

1, 2,3, 4 before shuffling, and so on.

Adj16, dividing adjacent 16 streams into a group, circularly shuffling in the group, inputting Dword sequentially input by streams 1 into

streams

1, 2,3, etc. before shuffling, respectively, and so on.

Where the event 4 skips 4 streams and performs the circular shuffling in the group, Dword sequentially input from Stream1 inputs streams 1, 5, 9, and 13 before the shuffling, respectively, and so on.

Where the event 16 skips 16 streams and performs grouping, the circular shuffling in the group is performed, and the Dword sequentially input from Stream1 inputs streams 1, 17, 33, 49 before the shuffling, respectively, and so on.

So far, this chapter completes introduction of an algorithm scheme of a lossless compression method, and after reviewing evaluation indexes of neural network compression listed in section 2.5, we can obtain the following conclusion: the hybrid coding proposed by the method is a lossless compression algorithm, can restore any coded numerical value in a neural network, and avoids the reduction of the Accuracy (Accuracy) of the neural network model in principle; for original data with fixed data bit width and size, the code rate (BitRate) is closely related to the overall compression effect, as shown in the following formula.

BitRate＝DataWidth×CompressionRatio

Therefore, experiments and analyses in the next chapter are mainly discussed around compression ratio (compression ratio), and the compression effect of the hybrid coding is quantified in a graph form, so that the advantages of the hybrid coding in the neural network operation and the optimization effect of the neural network compression are proved.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

The neural network quantization compression system based on inter-stream data shuffling is characterized in that each data stream is provided with an independent input buffer and an independent output buffer, the input buffer is used for buffering the data block, and the output buffer is used for buffering the compression result of the data block.

As shown in fig. 17, the present invention further provides a hybrid coding based neural network compression system, which includes:

the method comprises the following steps that a module 1 acquires neural network data to be compressed after quantization processing, and blocks the neural network data to obtain a plurality of data blocks;

the module 2 allocates a data stream to each data block for compression, and the compression of each data stream includes: run-length all-zero coding is carried out on the data block to obtain run-length compressed data, wherein the run-length all-zero coding only carries out run-length coding on zero characters in the neural network data, standardized Huffman coding is carried out on the run-length compressed data, and a coding result is reformed to obtain the standardized Huffman coding which is used as a compression result of the data block;

and a module 3 for collecting the compression result of each data block as the compression result of the neural network data.

The neural network compression system based on the hybrid coding is characterized in that the input and the output of all data streams in the module 2 are synchronously performed.

The neural network compression system based on the hybrid coding is characterized in that each data stream is provided with an independent input buffer and an output buffer, the input buffer is used for buffering the data block, and the output buffer is used for buffering the compression result of the data block.

The neural network compression system based on hybrid coding, wherein the module 2 comprises:

module 21 monitors the compressed and encoded data amount of each data stream, and determines whether P is satisfied_current-P_minIf yes, writing the virtual code corresponding to the virtual character into the output buffer memory of the current data stream, otherwise executing the module 21 again; wherein P is_currentInput queue width, P, encoded for the current data stream_minThe input queue width that has been encoded for the data stream with the slowest encoding rate, S is the pipeline depth of the data stream.

The neural network compression system based on the hybrid coding has the advantages that the data volume of the dummy codes written into the output buffer of the current data stream is P_current-P_min。

The hybrid coding-based neural network compression system, wherein the module 21 comprises: writing the virtual character into an output cache of the current data stream to obtain third intermediate data; and judging whether the character in the third intermediate data, which is the same as the virtual character, is the original character in the output cache of the current data stream, if so, replacing the character in the third intermediate data, which is the same as the virtual character, with the virtual code, and meanwhile, increasing a flag bit indicating that the character is the original character, otherwise, replacing the character in the third intermediate data, which is the same as the virtual character, with the virtual code, and meanwhile, increasing a flag bit indicating that the character is the replaced character.

The neural network compression system based on hybrid coding, wherein the module 2 comprises: and setting input buffers with the same number as the data streams for buffering the data blocks, wherein each data stream is provided with an output buffer for buffering the compression results of the data blocks.

The neural network compression system based on hybrid coding, wherein the module 2 comprises: the data stream randomly selects an input buffer from which to retrieve the data block.

The neural network compression system based on hybrid coding, wherein the module 2 comprises: the data stream selects an input buffer according to a preset rule and acquires the data block from the input buffer.

The neural network compression system based on the mixed coding is characterized in that the run bit width of the run all-zero coding is 2 bits.

The hybrid coding-based neural network compression system, wherein the run-length all-zero coding further comprises: run-length coding is carried out on zero data in the neural network data to obtain first intermediate data; replacing the coding segment with the run length of 3 of the first intermediate data with a zeroLiteral character to obtain second intermediate data; and judging whether the character in the second intermediate data, which is the same as the zeroLiteral character, is the original character in the neural network data, if so, replacing the character in the second intermediate data, which is the same as the zeroLiteral character, with a zeroExtra character, and adding a flag bit indicating that the character is the original character, otherwise, replacing the character in the second intermediate data, which is the same as the zeroLiteral character, with the zeroExtra character, and adding a flag bit indicating that the character is the replaced character.

The neural network compression system based on the hybrid coding is characterized in that the zeroLiteral character and the zeroExtra character are two characters with the lowest occurrence frequency in the neural network data respectively.

The neural network compression system based on the hybrid coding is characterized in that the coding result is a Huffman tree.

and reforming the Huffman tree from top to bottom, wherein the reforming is to move leaf nodes in the same level node of the Huffman tree to the left side of the binary tree.

The neural network compression system based on the hybrid coding is characterized in that the neural network data are weight data and neuron input data in the neural network operation.

The neural network compression system based on the hybrid coding is characterized in that if the standard Huffman coding is larger than or equal to the neural network data, the standard Huffman coding is abandoned, and the neural network data is directly used as the compression result.

Claims

1. A neural network quantization compression method based on inter-stream data shuffling is characterized by comprising the following steps:

2. The method of claim 1, wherein each of the data streams has a separate input buffer for buffering the data block and an output buffer for buffering a compression result of the data block.

3. The neural network quantization compression method based on inter-stream data shuffling of claim 2, wherein step 2 comprises:

4. The method as claimed in claim 3, wherein the writing of the dummy code into the output buffer of the current data stream is performed by the amount P_current-P_min。

5. The inter-stream data shuffling-based neural network quantization compression method as claimed in claim 4, wherein the step 21 comprises: writing the virtual character into an output cache of the current data stream to obtain third intermediate data; and judging whether the character in the third intermediate data, which is the same as the virtual character, is the original character in the output cache of the current data stream, if so, replacing the character in the third intermediate data, which is the same as the virtual character, with the virtual code, and meanwhile, increasing a flag bit indicating that the character is the original character, otherwise, replacing the character in the third intermediate data, which is the same as the virtual character, with the virtual code, and meanwhile, increasing a flag bit indicating that the character is the replaced character.

6. A neural network quantization compression system based on inter-stream data shuffling, comprising:

7. The inter-stream data shuffle based neural network quantization compression system of claim 6, wherein each of said data streams has separate input and output buffers, an input buffer for buffering said data block and an output buffer for buffering the compression result of said data block.

8. The inter-stream data shuffle based neural network quantization compression system of claim 7, wherein module 2 comprises:

a module 21, configured to monitor the amount of compressed and encoded data in each data stream, and determine whether P is satisfied_current-P_minIf yes, writing the virtual code corresponding to the virtual character into the output buffer memory of the current data stream, otherwise executing the module 21 again; wherein P is_currentFor the input queue width, P, of the current data stream_minS is the pipeline depth of the data stream for the input queue width of the data stream with the slowest encoding rate.

9. The inter-stream data shuffle based neural network quantization compression system of claim 8, wherein the amount of data written to the output buffer of the current data stream for the dummy code is P_current-P_min。

10. The inter-stream data shuffle based neural network quantization compression system of claim 9, wherein the module 21 comprises: writing the virtual character into an output cache of the current data stream to obtain third intermediate data; and judging whether the character in the third intermediate data, which is the same as the virtual character, is the original character in the output cache of the current data stream, if so, replacing the character in the third intermediate data, which is the same as the virtual character, with the virtual code, and meanwhile, increasing a flag bit indicating that the character is the original character, otherwise, replacing the character in the third intermediate data, which is the same as the virtual character, with the virtual code, and meanwhile, increasing a flag bit indicating that the character is the replaced character.