CN108805286A

CN108805286A - High performance network accelerated method based on high-order residual quantization

Info

Publication number: CN108805286A
Application number: CN201810604458.6A
Authority: CN
Inventors: 倪冰冰; 李泽凡
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-06-12
Filing date: 2018-06-12
Publication date: 2018-11-13

Abstract

The present invention provides a kind of high performance network accelerated methods based on high-order residual quantization, including：Step S1 obtains a series of binary system data of different scales by quantization and recursive operation；Step S2 carries out convolution algorithm to the binary system data of different scale, and is combined to obtained operation result.The present invention is a kind of effective accurately depth network accelerating method.This concept of residual error is referred to indicate information loss and recursively calculate the residual error of the input data after various sizes of quantization to reduce information loss.It is inputted using binaryzation weight and binaryzation, the size of network is reduced to original 1/32 or so, and training speed is improved 30 times or so.Method proposed by the present invention additionally provides the possibility of the training depth convolutional network on CPU.Experimental result shows that HORQ networks proposed by the present invention have good classifying quality and acceleration effect.

Description

High-performance network acceleration method based on high-order residual quantization

Technical Field

The invention relates to a deep network acceleration method, in particular to a high-performance network acceleration method based on high-order residual quantization.

Background

The method of binarizing the input tensor has proven to be an effective network acceleration method. But the existing binarization method can be considered as a simple thresholding operation (first order approximation) for the pixel points and has a large loss of precision. Methods for accelerating the deep web learning speed can be roughly divided into three methods. The simplest approach is to perform network pruning and retrain the pruned network structure. In order to achieve higher structural compression rates, researchers have subsequently developed structural sparse approximation techniques to change larger-shaped sub-networks to shallow sub-networks. However, for networks with different network structures, this method requires designing a corresponding appropriate approximation network structure. Recently, the academia has proposed a new network binarization solution, which converts the network weight and the corresponding forward or backward data stream into a binarization representation form, so as to reduce the calculation amount and the network storage space. Some methods binarize the input image data also by a thresholding operation. However, although the method improves the network training speed, the accuracy of the classification network is also sharply reduced. Previous input data binarization operations, simply using positive and negative thresholds, can be considered as very coarse quantization for floating point data, which can also be considered as first order binary approximation methods.

In summary, none of the existing network acceleration methods can effectively and high-performance improve the network speed, and no explanation or report of the similar technology of the present invention is found, and no similar data is collected at home and abroad.

Disclosure of Invention

In view of the above-mentioned shortcomings in the prior art, the present invention aims to provide a high-performance network acceleration method based on high-order residual quantization, which is a high-order binarization scheme, and particularly a new binarization quantization mode called high-order residual quantization (HORQ) method, which realizes binarization for both input and weight, and can improve the calculation speed by binarization while achieving a more accurate approximate calculation process. In particular, the proposed scheme recursively performs residual quantization and produces a series of binarized input images of decreasing size. In addition, the invention also provides a forward and backward calculation high-order binary filtering and gradient propagation operation.

The invention is realized by the following technical scheme.

A high-performance network acceleration method based on high-order residual quantization comprises the following steps:

step S1, obtaining a series of binary input data with different scales through quantization and recursive operation;

step S2, carrying out convolution operation on binary input data with different scales, and combining the obtained operation results;

and step S3, training the convolutional neural network by using the result obtained in the step S2, and further completing the accelerated training of the convolutional neural network.

Preferably, the step S1 includes the following sub-steps:

step S11, calculating a first order residual error, and further approximating the first order residual error through thresholding operation;

in step S12, step S11 is recursively executed to obtain a series of binarized residual tensors corresponding to different quantization scales as binary input data.

Preferably, the step S11 includes the following sub-steps:

step S111, assuming an input data tensor matrix X, quantizing by the following process to obtain a first order residual error of the input data tensor matrix X:

X≈β₁H₁；

wherein, beta₁Is a real number; h₁∈{+1，-1}ⁿRepresenting a first order binary residual tensor; n is tensor H₁Dimension (d);

step S112, optimizing the quantization result in step S111:

wherein J (-) represents a squared error loss function; the obtained optimization result is a first-order binarization quantization result:

wherein l1 is the norm of 1; value here

Step S113, calculating the actual input data X and the first-order binary quantization resultThe difference between them to define a first order binarized residual tensor R₁(X), further approximating a first order residual:

R₁(X)＝X-β₁H₁

R₁(X) is used to indicate information loss due to the approximation.

Preferably, step S12 includes the following sub-steps:

step S121, further quantizing R₁(X) is represented as follows:

R₁(X)＝β₂H₂

wherein, beta₂Is a real number; h₂∈{+1，-1}ⁿRepresenting a second order binary residual tensor, n being tensor H₂Dimension (d);

obtaining a second order residual quantization of the input data:

X＝β₁H₁+R₁(X)≈β₁H₁+β₂H₂

wherein, beta₁and beta₂Are respectively real scalar, H₁And H₂respectively a binary residual tensor, beta₁H₁called the first order binary input tensor, and β₂H₂Referred to as the second order binarized input tensor;

step S122, solving the optimization problem of the second-order binarization input tensor:

the obtained optimization results are as follows:

value hereFurther obtaining a second-order binarization residual tensor as follows:

R₂(X)＝R₁(X)-β₂H₂；

in the optimization process, the following inequality is obtained:

where the L2 norm of the second order binarized residual tensor is used to represent information loss.

Preferably, the method further comprises the following steps:

further quantizing second order twoValued residual tensor R₂(X) quantizing the residual error of K orders to obtain the quantization of the residual error of K orders of the input data X:

wherein,

preferably, the step S2 includes the following sub-steps:

step S21, reshaping the binary input data;

in step S22, convolution calculation is performed using the reshaped binary input data.

Preferably, the step S21 includes the following sub-steps:

step S211, assume that the dimension of binary input data isOf the input data tensor matrix X, the convolution weight tensor matrix W has the dimension ofFor W, divide into c_outA filter, each filter is reshaped to 1 × (c)_inX w × h); the entire convolution weight tensor matrix W is reshaped into W_rW obtained after remodeling_rHas a dimension of c_out×(c_inX w × h); defining the output of the convolutional layer by Y, then the dimension of Y isFor the input data tensor matrix X, there is a total of h_out×w_outA sub-tensor, each sub-tensor is reshaped into the same dimension as the filter, so that a reshaped X is obtained_rHas a dimension of (c)_in×w×h)×w_out×h_out；

Wherein, c_inNumber of channels representing input tensor, w_inWidth, h, of each channel representing the input tensor_inHeight of each channel representing the input tensor, c_outDenotes the number of convolution kernels, W denotes the width of the convolution kernel, h denotes the height of the convolution kernel, W_outWidth, h, of each channel representing the output tensor_outRepresenting the height of each channel of the output tensor;

step S212, multiply Y using the matrix_r＝W_rX_rTo represent the calculation result of the convolution operation;

step S213, converting Y_rTo Y to complete the entire reshaping calculation process.

Preferably, the step S22 includes the following sub-steps:

step S221, for W_rAnd (3) quantization:

W_r(i)≈α_iB_i，

wherein, W_r(i)Is W_rrow i of (a)_iRepresents a constant, B_iRepresents W_r(i)A first order binary approximation of;

step S222, for X_rAnd (3) quantization:

X_r(i)≈β_1(i)H_1(i)

wherein, X_r(i)Is X_rRow i of (1)；

Step S223, using the quantized W_rAnd X_rAnd carrying out binarization convolution calculation.

The above-described binarization convolution calculation problem can be solved by using the algorithm in fig. 2.

The invention provides a high-order residue quantization-based high-performance network acceleration method, which is a binary quantization method through recursive thresholding operation, and provides a high-order residue quantization frame. Therefore, the present invention can obtain a series of binarized images corresponding to different quantization scales. Based on these binarized input tensors (stacked binarized maps of different scales), the present invention develops an efficient binarization filtering operation for forward and backward computations.

Compared with the prior art, the invention has the following beneficial effects:

the invention can accelerate the training speed of the network and reduce the information loss in the quantization process as little as possible; the method of the invention uses the input data of the binaryzation and the weight of the binaryzation at the same time, and after one-time quantization is finished, the next quantization operation is carried out recursively by using the calculated residual error. The method has strong flexibility and can adapt to different experimental conditions.

The invention can improve the running speed and ensure the training effect of the network. And the binary input and the weight are used, so that the network is accelerated, and the information loss is reduced as much as possible.

The present invention utilizes the concept of recursion to fully utilize the tensors generated by the recursion process. Combining the tensors generated by recursion, the quantization results of different scales are used.

The method can adapt to different hardware requirements and accelerate the selection of a proper residual error order. When the residual error order is two or three, the network can be accelerated, and the information loss can be reduced as much as possible.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of a tensor reshaping process in accordance with an embodiment of the present invention;

FIG. 2 is a flow chart of an algorithm for second order binary convolution according to an embodiment of the present invention;

FIG. 3 is a flow chart of the training of the HORQ network according to an embodiment of the present invention;

FIG. 4 is a graph of experimental comparison results on a CIFAR-10 data set according to an embodiment of the present invention;

FIG. 5 is a graph of experimental comparison results on MNIST data sets;

FIG. 6 is a graph of acceleration rate versus number of channels in accordance with an embodiment of the present invention;

FIG. 7 is a graph of acceleration rate versus quantization order for one embodiment of the present invention;

FIG. 8 is a graph of acceleration rate versus convolution kernel size for one embodiment of the present invention;

fig. 9 is a schematic diagram of an algorithm of high order residual quantization according to an embodiment of the present invention.

Detailed Description

The following examples are given to illustrate the present invention, and the following examples are carried out on the premise of the technical solution of the present invention, and give detailed embodiments and specific procedures, but the scope of the present invention is not limited to the following examples.

The invention provides a high-performance network acceleration method based on high-order residual quantization, which is a high-order binarization scheme, realizes a more accurate approximate calculation process and simultaneously has the advantage that binarization can improve the calculation speed. In the following embodiments, in particular, the proposed scheme recursively performs residual quantization and produces a series of binarized input images of decreasing size. In addition, the embodiment also provides a forward and backward calculation high-order binary filtering and gradient propagation operation.

The following embodiment method is a new binarization quantization method called High Order Residual Quantization (HORQ). It implements binarization of both the input and the weights. First, a series of binary input data of different scales is obtained. And then, carrying out convolution operation on binary input data with different scales, and combining the obtained calculation results. This approach successfully reduces information loss.

The method specifically comprises the following steps:

a first order residual is calculated and then a new round of thresholding operation is performed to further approximate the first order residual. The binarized version approximation of the residual can be considered a higher order binary input. The above operations are performed recursively and finally a series of binarized residual tensors corresponding to different quantization scales can be obtained as binary input data. Based on these binary input data, an efficient binarization filtering operation for forward and backward computations is developed.

The input to the convolutional layer is a four-dimensional tensor. If the reshaped input tensor and corresponding weight filter are in a matrix, the convolution operation can be considered a matrix multiplication. The operation for each element in the matrix multiplication process can be considered a vector operation. Consider first the case where the input is a one-dimensional vector: suppose there is an input data tensor matrix as vector X ∈ Rⁿit can be quantified by a process in which X ≈ β₁H₁In which H is₁∈{+1，-1}ⁿ. The results are then obtained by solving the following optimization problem:

the solution to this problem is:

wherein beta is₁Is a real number. The above problem can be considered as a first order binarization quantization problem, and a first order residual tensor R can be defined by calculating the difference between the actual input and the result after the first order binarization quantization₁(X)：

R₁(X)＝X-β₁H₁

After the above parameter determination, R can be used₁(X) represents information loss due to the approximation.

Due to R₁(X) is a real number tensor, and R1(X) can be further quantized as follows:

R₁(X)＝β₂H₂

wherein beta is₂Is a real number, H₂∈{+1，-1}ⁿ。

Then a second order residual quantization of the input data can be obtained:

X＝β₁H₁+R₁(X)≈β₁H₁+β₂H₂

wherein beta is₁and beta₂Is a real scalar, H₁And H₂is the binarized residual tensor β₁H₁called the first order binary input tensor, and β₂H₂Referred to as the second order binarized input tensor. The above problem can be solved using a similar approach to the previous solution process.

Firstly, the corresponding optimization problem is solved:

the solution of the above equation is:

here, the number of the first and second electrodes,is the optimum beta₁，H₁take value of, for arbitrary beta₁，H₁Can all calculate the corresponding R₁(X), corresponding R can be calculated in the same way₂(X); for further explanation, values are given here

The binary approximation method provided by the embodiment of the invention has better theoretical and practical effects than the existing traditional method.

The information loss of the first and second order approximation methods can be compared. In the conventional first-order binarization method, R₁(X) is defined as the residual approximation tensor. Similarly, the second order residual tensor in the new proposed method is defined as R₂(X)＝R₁(X)-β₂H₂. In the optimization process, the following inequality can be obtained:

therefore, if the L2 norm of the residual tensor is used to represent the information loss, it can be demonstrated that the second-order residual quantization method proposed in the embodiment reduces the amount of information loss.

Further, the second order residual quantization method can be extended to K order residual quantization:

wherein,

higher order input tensors can be obtained by recursively calculating the residual tensor. In fact, if the order becomes higher, the information loss becomes less, but the amount of calculation increases much. It can be found that the effect of the second and third order residual quantization is good enough. In the embodiment, second-order residual quantization is mainly used, and appropriate operation amount can be kept on the premise of ensuring the residual quantization effect.

The HORQ binarization input is accepted, and high-order binarization filtering is carried out in the forward and backward calculation processes. First is the tensor reshaping process, as shown in figure 1. Assume that the binary input data is a tensor matrix X having dimensions ofThe convolution filter has dimension W ofFor the weight tensor W, the division into c_outA filter, each of which can be reshaped to 1 × (c)_inX w × h). So that the entire matrix W is reshaped into W_rDimension is c_out×(c_inX w × h). Defining the output of the convolutional layer by Y, then the dimension of Y isFor the input tensor X, there is a total of h_out×w_outA sub-tensor, each sub-tensor is reshaped into the same dimension as the filter, so that a reshaped X is obtained_rDimension of (c)_in×w×h)×w_out×h_out. Then using matrix multiplication Y_r＝W_rX_rTo represent the result of the convolution operation. Finally, the Y is put_rTo Y to complete the entire reshaping calculation process.

After the tensor reshaping process, the next structure is a convolution computation process using second order residual quantization. The second order residual quantization process is used to calculate X_rAnd W_rThe matrix product of (2). W_rFirst quantized to:

W_r(i)≈α_iB_i，

wherein, W_r(i)Is W_rRow i of (2). Then, we use second order residual quantization to pair the input matrix X_rQuantization is performed.

X_r(i)≈β_1(i)H_1(i)

Wherein, X_r(i)Is X_rRow i of (2). The above-described binarization convolution calculation problem can be solved by using the algorithm in fig. 2.

The process of training a HORQ network using the second order residual quantization method proposed in the above embodiment is shown in fig. 3. Common steps include forward propagation, backward propagation, and parameter updates. In the forward and backward propagation process, the input is the binary expression of the input and the weight. In fact, the high-order residual quantization method in the present invention can be conveniently applied in convolutional layers and fully-connected layers. In the training process, input and weight are quantized, and a binarization convolution process is calculated layer by layer. After forward propagation, the backward propagation is performed with the binarized weights and inputs, and the gradient is calculated using the sign function sign (). The actual values of the parameters and inputs are used in updating the parameters because the magnitude of the parameter update is small at each iteration. If the weight of binarization is updated, the updated value will disappear in the binarization process, and the training of the network will be affected.

The high-performance network acceleration method based on the high-order residual quantization provided by the embodiment is an effective and accurate deep network acceleration method of a high-order binary approximation method. The concept of residual is referred to represent information loss and the residual of quantized input data of different sizes is recursively calculated to reduce information loss. By using the binarization weight and the binarization input, the size of the network is reduced to about 1/32 of the original size, and the training speed is improved by about 30 times. The method proposed by the above embodiment also provides the possibility to train a deep convolutional network on the CPU. The experimental result shows that the HORQ network provided by the embodiment has good classification effect and acceleration effect.

This is further described below in conjunction with specific examples.

Example experiments were performed on both MNIST and CIFAR-10 databases. The MNIST database is a database for image classification, containing 0-9 digital images of handwriting. For comparison with other methods, the same MLP structure as the comparative method was also used in the experiment. This structure contains three hidden layers, a 4096-dimensional second-order residual quantized connected structure and an L2-SVM structure (using the Hinge error function). In order to train the MLP, convolution processing, data preprocessing, data gain and pre-training are not used. ADAM adaptive learning is used, and batch normalization is also used for batch size 200 data to improve training speed. The MLP structure of the first order binarization connection (XNOR) was also used in the experiments to compare with the final test accuracy of the method of the present embodiment. The two methods use the same network structure, and the final result shows that the method provided by the embodiment of the invention has 0.71% higher accuracy than the XNOR method, and the method in the embodiment of the invention has faster training convergence speed. From the variation process of the value of the hind error function, the error reduction process of the HORQ method appears smoother, although both methods can reduce the hind error to a relatively small value. Other researchers' work before used binarization weight and floating point number type inputs, and the method used in the embodiment of the present invention used binarization weight and binarization inputs. The experimental results on the MNIST database show that the method used in the embodiment of the invention can realize the speed increase of the deep network and reduce the precision loss as much as possible. As shown in fig. 4 and 5.

On the CIFAR-10 database, 50000 pictures were used for training and 10000 pictures were used for testing. No data preprocessing procedure and no method of data gain was used during the experiment. In order to show the difference between the second-order residual quantization method proposed by the embodiment of the present invention and the conventional first-order binarization quantization method (XNOR), a convolutional neural network with a smaller number of layers is used in the experiment. The number of samples per batch was 50, which enabled faster training and normalized training data.

In the comparison of experimental results, the results of baseline experiments are inferior to some comparative results because the convolutional neural network structure used in the experiments is not very complex, but a shallower network is advantageous for comparing the method of the present embodiment with the XNOR. On the premise of using the same network structure, the experimental result of the embodiment of the invention is about 5% higher than that of the XNOR, and the network convergence speed is basically the same. This demonstrates the greater effectiveness of the method employed in the embodiments of the present invention.

The calculated amount of the method proposed in the embodiment of the present invention is analyzed below. For a convolutional neural network, the input isThe convolution weight tensor matrix isTotal number of operations c_out×c_in×wh×w_inh_in. By using the existing CPU suitable for calculating 64-bit binary data operation, the total number of operations required for K-order residual quantization used in the embodiment of the present invention is:

K×c_out×c_in×wh×w_inh_in+(K+1)×w_inh_in＝KN_p+(K+1)N_n

among these calculation operations, KN_pOperations pertaining to binarization precision, which can be accelerated, and another (K +1) N_nThe calculation operation is a floating point number precision operation and cannot be accelerated. Thus the acceleration rate is

For the second order case, the acceleration rate of the embodiment of the present invention is

As can be seen from the above equation, the acceleration rate is independent of the width and height of the input tensor, but is related to the size of the filter and the number of channels. First, the number of channels is fixed: c. C_outc_inThe filter size versus acceleration rate was observed at 10 × 10, and the results are shown in fig. 6. Then, we fix the filter size to w × h 3 × 3, input the number of channels to 3, and observe what influence the input number of channels has on the acceleration rate, and the result is shown in fig. 7. The experimental result shows that if the channel number and the size of the filter are too small, the acceleration effect is small, so that the binarization method is applied to the embodiment of the inventionIn DCNN, the use of a network layer with too few quantization channels can be avoided. If the setting parameter is c_inc_out64 × 256, w × h 3 × 3, the second quantization residual quantization method can be accelerated by a factor of 31.98. In practice, however, the acceleration rate may be slightly lower than this value due to limitations in the memory read process and the data processing process. The relationship of the quantization order to the acceleration ratio is shown in fig. 8. The experimental result shows that the second-order and third-order residual error quantification methods provided by the embodiment of the invention have a strong acceleration effect under the condition of ensuring the accuracy effect. The complete framework of the invention is shown in fig. 9.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A high-performance network acceleration method based on high-order residual quantization is characterized by comprising the following steps:

2. The high-performance network acceleration method based on high-order residual quantization of claim 1, characterized in that said step S1 comprises the following sub-steps:

3. The high-performance network acceleration method based on high-order residual quantization of claim 2, characterized in that said step S11 comprises the following sub-steps:

X≈β₁H₁；

step S112, optimizing the quantization result in step S111:

wherein l1 is the norm of 1; value here

Step S113, calculating the actual input numberAccording to X and the first-order binary quantization result The difference between them to define a first order binarized residual tensor R₁(X), further approximating a first order residual:

R₁(X)＝X-β₁H₁

R₁(X) is used to indicate information loss due to the approximation.

4. The high-performance network acceleration method based on high-order residual quantization of claim 3, characterized by that, step S12 comprises the following sub-steps:

step S121, further quantizing R₁(X) is represented as follows:

R₁(X)＝β₂H₂

obtaining a second order residual quantization of the input data:

X＝β₁H₁+R₁(X)≈β₁H₁+β₂H₂

the obtained optimization results are as follows:

R₂(X)＝R₁(X)-β₂H₂；

in the optimization process, the following inequality is obtained:

5. The high-performance network acceleration method based on high-order residual quantization of claim 4, characterized by further comprising the steps of:

further quantizing the second order binarization residual tensor R₂(X) quantizing the residual error of K orders to obtain the quantization of the residual error of K orders of the input data X:

wherein,

6. the high-performance network acceleration method based on high-order residual quantization of claim 1, characterized in that said step S2 comprises the following sub-steps:

step S21, reshaping the binary input data;

7. The high-performance network acceleration method based on high-order residual quantization of claim 6, characterized in that said step S21 comprises the following sub-steps:

Wherein, c_inNumber of channels representing input tensor, w_inWidth, h, of each channel representing the input tensor_inHeight of each channel representing the input tensor, c_outRepresenting the number of convolution kernels, w representing the width of the convolution kernels, h representing the height of the convolution kernels, w_outWidth, h, of each channel representing the output tensor_outRepresenting the height of each channel of the output tensor;

step S213, converting Y_rIs reshaped into Y to complete the wholeAnd (4) a remodeling calculation process.

8. The high-performance network acceleration method based on high-order residual quantization of claim 7, characterized in that said step S22 comprises the following sub-steps:

step S221, for W_rAnd (3) quantization:

W_r(i)≈α_iB_i，

step S222, for X_rAnd (3) quantization:

X_r(i)≈β_1(i)H_1(i)

wherein, X_r(i)Is X_rRow i of (1);