CN109190759A

CN109190759A - Neural network model compression and accelerated method of the one kind based on { -1 ,+1 } coding

Info

Publication number: CN109190759A
Application number: CN201810866365.0A
Authority: CN
Inventors: 孙其功; 焦李成; 杨康; 尚凡华; 李秀芳; 侯彪; 杨淑媛; 李玲玲; 郭雨薇; 唐旭; 冯志玺
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2019-01-11

Abstract

The invention discloses one kind to be based on { -1, + 1 } the neural network compression decomposed and accelerated method, mainly solve the problems, such as that deep neural network model can not be used well to mobile phone and low profile edge equipment in the prior art first constructs neural network model, then { -1,1 number of encoding bits } are determined；Retraining neural network model parameter simultaneously quantifies；{ -1,1 } are carried out again to the model parameter after quantization again to encode；Coding layer is added to neural network model again；Neural network internal matrix or vector multiplication operation are finally substituted for the operation of several two-value bit arithmetics.The present invention, which has, carries out again { -1,1 } coding to neural network parameter and activation value, reduces the storage space of model parameter, implementation model calculates the advantages of accelerating.

Description

Neural network model compression and acceleration method based on { -1, +1} code

Technical Field

The invention belongs to the technical field of deep neural networks, and particularly relates to a neural network model compression and acceleration method based on { -1, +1} coding, which is applied to compression and acceleration of a deep neural network model.

Background

Deep neural networks have made a great breakthrough in the fields of image classification, target detection, natural language processing, and the like. In computer vision tasks, applying convolutional neural networks in the field of neural networks often yields results that are superior to other methods of machine learning. However, the deep neural network has the characteristics of deep network depth and large network parameter number, so that the general deep neural network can only complete tasks on GPUs with large video memory and strong calculation power, and cannot perform well on mobile phones and small embedded devices with small memory and weak calculation power. The quantization and acceleration of the deep neural network model are one of the methods for solving the problem.

The existing binarization parameter neural network method has the defects that the data characteristics are more in the classification task and the target detection task of a large data set, the target detection relates to the regression problem, and the characteristic representation capability after parameter binarization is insufficient, so that the method has poor effect on the classification task and the target detection task of the large data set.

The existing acceleration and compression method for the deep convolutional neural network reduces the characteristic extraction capability of the original network due to the fact that the deep convolutional neural network parameters are sparse by applying a plurality of subcodebooks and indexes corresponding to the plurality of codebooks, is difficult to apply to mobile phones and small embedded equipment, and is limited in acceleration ratio.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for compressing and accelerating a neural network model based on { -1, +1} coding, which is to perform { -1,1} coding on quantized neural network parameters and eigenvalues by quantizing deep neural network parameters and eigenvalues, and decompose an internal matrix or vector multiplication operation of the coded neural network into a plurality of binary bit operations, thereby improving the utilization rate of hardware resources and realizing the compression and acceleration of parameters. By using the method, the encoding, calculation and storage of the { -1,1} of different digits of the data under different precisions are realized by selecting the encoding digit.

The invention adopts the following technical scheme:

a neural network model compression and acceleration method based on { -1, +1} code is characterized by firstly constructing a neural network model; then, carrying out segmentation processing on [ -1,1] according to the precision required by a specific task, and determining the bit number M for carrying out { -1,1} coding on the model parameter and the activation value again according to the number of segments; initializing a neural network model and quantizing model parameters; training a neural network model and quantizing model parameters and activation values according to the number M of bits of the { -1,1} code in the training by using a linear quantization formula; carrying out M-bit number { -1,1} coding on the trained quantized neural network model parameters; adding an encoding layer for M-bit-1, 1-encoding the activation values again after each activation layer of the neural network model; the internal matrix or vector multiplication operation of the neural network is replaced by a plurality of binary bit operations, thereby realizing model compression and acceleration.

Specifically, the activation function of the neural network model is an activation function HReLU, which is specifically as follows:

here, hrelu (x) represents an activation value obtained by taking x as an input.

Specifically, the segmentation processing specifically includes: will [ -1,1 [ ]]Is divided into 2^MSegment according to the number of segments 2^MDetermining the number of bits M for subsequent re-1, 1 encoding of the model parameters and activation values.

Specifically, a weight with a mean value of 0 and a variance of 1 of the neural network model is initialized, and the model parameters are quantized by using a linear quantization formula according to the bit number M of the { -1,1} code.

Further, the linear quantization formula is as follows:

wherein q is^M(x) Denotes a quantization parameter value obtained under M-bit-1, 1 coding, x denotes a full precision parameter value, M denotes the number of bits of the-1, 1 coding,<·>indicating a rounding operation.

Specifically, the quantized neural network model parameters after training are subjected to M-bit number { -1,1} encoding by adopting an encoding formula MBitEncoder, wherein the encoding formula MBitEncoder is as follows:

wherein MBitEncoder (x) indicates M-bit { -1,1} encoding of x, M indicates the total number of bits of the { -1,1} encoding, M indicates the M-th bit of the M-bit { -1,1} encoding,represents a determined value in an M-th bit code of M-bit { -1,1} codes and

further, the quantized value x is quantized_qAfter M-bit { -1,1} coding, it can be expressed as:

wherein x is_qmDenotes x_qM denotes the total number of bits of the coded { -1,1} code, x_qEach bit can be represented as:

x_q→[x_qMx_q(M-1)x_q(M-2)...x_q1]

wherein, the [ alpha ], [ beta ]]Denotes x_qInner coding form of (1), x_qMDenotes x_qInner-1, 1 encodes the state of the Mth bit and x_qM∈{-1,1}。

Specifically, the matrix or vector multiplication operation inside the neural network is replaced by a plurality of binary bit operation operations, each bit of the encoded parameter and the activation value is separated, and the multiplication operation between the model parameter and the activation value after separation is decomposed into a plurality of binary bit operation calculations.

Further, the inner product of the encoded activation value vector x and the encoded neural network parameter vector w is represented as:

wherein x is^Tgw is decomposed into M · K binary bit operator operations,denotes x_mEach bit of (A) andand carrying out XOR operation on corresponding bits and calculating the number of minus results of the total number of 1 in the XOR results to be-1, wherein M is the total number of bits of the activation value codes, and K is the total number of bits of the network parameter codes.

Further, each element of the encoded activation value vector x is represented as follows using { -1,1} encoding:

wherein,denotes xⁱValue of m-th bit of element and

and (3) transforming the vector x after the { -1,1} coding as follows:

wherein,the ith bit of each element in the vector is taken out to form a vector, and the encoded neural network parameter vector w is also transformed as the vector x

Compared with the prior art, the invention has at least the following beneficial effects:

the invention relates to a neural network model compression and acceleration method based on { -1, +1} coding, which carries out { -1,1} coding on model parameters and activation values of a neural network, the coding bit number can be freely selected, and the internal matrix or vector multiplication operation of the neural network is decomposed into a plurality of binary bit operations, thereby overcoming the problem that the model parameters can only be coded under single precision after being quantized in the prior art, so that the invention compresses and accelerates the model parameters under different coding precisions and has higher model acceleration ratio on an FPGA (field programmable gate array), and realizes the encoding, calculation and storage of different { -1,1} bits of data under different precisions by selecting the coding bit number, and the invention is not only competent for classification tasks, but also can be used for target detection tasks.

Furthermore, because the new activation function HReLU is used for restraining the activation value before coding, and the activation value is restrained between [0 and 1], the problems that a quantization model is difficult to converge and insensitive to a full-precision weight in the prior art are solved, so that the quantization neural network built by the invention can be initialized by directly applying weight parameters trained by the neural network, and the convergence speed of the quantization neural network training is accelerated.

Furthermore, the invention uses the coding formula MBitEncoder to carry out { -1,1} coding on the model parameters and the activation values, the number of coding bits can be freely selected, and the internal matrix or vector multiplication operation of the coded neural network is decomposed into a plurality of binary bit operations, thereby overcoming the problem that the model parameters can only be coded under single precision after being quantized in the prior art, and leading the invention to compress the model parameters and accelerate the model under different coding precisions and to have higher model acceleration ratio on FPGA.

Furthermore, as the model parameters and the activation values { -1,1} encoding bit number of the neural network are variable, the model expression capability can be improved by changing the encoding bit number, the problem of poor classification effect of a binary network in a large-scale classification data set ImageNet in the prior art is solved, and the neural network can be applied to a neural network of a target detection task.

In conclusion, the method has the advantages of carrying out the encoding of the neural network parameters and the activation values again { -1,1}, reducing the storage space of the model parameters and realizing the acceleration of the model calculation.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of the 2-bit encoded neural network fully connected algorithm of the present invention.

Detailed Description

The invention provides a neural network model compression and acceleration method based on { -1, +1} coding, which comprises the steps of firstly constructing a neural network model; then, carrying out segmentation processing on [ -1,1] according to the precision required by a specific task, and determining the bit number M of carrying out { -1,1} coding on the model parameters and the activation values again according to the segmentation number; initializing a neural network model, and quantizing the model parameters by using a linear quantization formula; training a neural network model and quantizing model parameters and activation values by using a linear quantization formula according to the number M of bits of the { -1,1} code in the training; carrying out M-bit number { -1,1} coding on the trained quantized neural network model parameters by using a coding formula defined by the invention; adding an encoding layer for M-bit-1, 1-encoding the activation values again after each activation layer of the neural network model; and the multiplication operation of the matrix or the vector in the neural network is replaced by a plurality of binary bit operations, so that the model compression and acceleration are realized.

Referring to fig. 1, a neural network model compression and acceleration method based on { -1, +1} coding of the present invention includes the following steps:

s1, constructing a neural network model:

constructing a required neural network model according to a specific task; the neural network model generally comprises a convolutional layer, a pooling layer, a full-connection layer and the like; the activation function of the model is an activation function HTanh or an activation function HReLU defined by the invention;

wherein the activation function HTanh is formulated as follows:

where htanh (x) represents the activation value obtained with x as input.

The activation function HReLU formula defined by the present invention is as follows:

S2, precision pair [ -1,1] according to specific task requirement]Is segmented, and [ -1,1] is]Is divided into 2^MSegment according to the number of segments 2^MDetermining the number of bits M for subsequent re-encoding of the model parameters and the activation values by-1, 1;

for example, will [ -1,1 [ ]]Divided into 4 sections ([ -1, -0.5), [ -0.5,0), [0,0.5) and [0.5,1 []) Then 2^M4, the number M of bits of { -1,1} code is 2;

s3, carrying out weight initialization with the mean value of 0 and the variance of 1 on the neural network model, and quantizing the model parameters by using a linear quantization formula according to the digit M of the { -1,1} code;

different { -1,1} encoding bit numbers can generate quantization with different precision, and the quantization precision is higher as the { -1,1} encoding bit number is more.

For example, quantizing x with the number of coded bits { -1,1} being 1 bit_q1E { -1,1}, use x_q1-1 represents the value of [ -1,0), using x_q11 represents [0,1]]Is a value of (1). Quantizing to obtain x under the condition that the coded number of { -1,1} is 2 bits_q2∈{-1,-1/3，1/3，1}；

The linear quantization formula used by the model is as follows:

S4, training the neural network model and quantizing the model parameters and the activation values according to the number M of the coded bits of { -1,1} in the training by using the linear quantization formula in the step S3;

s5, carrying out M-bit number { -1,1} coding on the trained quantized neural network model parameters by using the coding formula defined by the invention;

for example, x quantized under two-bit { -1,1} coding_q2E { -1, -1/3, 1/3, 1}, and the inner coding of the two-bit { -1,1} code is represented as { [ -1 } by re-encoding],[-11],[1-1],[11]}；

The invention defines a { -1,1} coding formula MBitEncoder as follows:

wherein MBitEncoder (x) denotes M-bit { -1,1} encoding x, M denotes the total number of bits of the { -1,1} encoding, M denotes the M-th bit of the M-bit { -1,1} encoding,represents a determined value in an M-th bit code of M-bit { -1,1} codes and

for quantized value x_qAfter M-bit { -1,1} coding, it can be expressed as:

wherein x_qmDenotes x_qM denotes the total number of bits of the coded { -1,1} code, x_qEach bit can be represented as:

x_q→[x_qMx_q(M-1)x_q(M-2)...x_q1](6)

wherein]Denotes x_qInner coding form of (1), x_qMDenotes x_qInner-1, 1 encodes the state of the Mth bit and x_qM∈{-1,1}。

S6, adding an encoding layer for M-bit-1, 1-encoding the activation values again after each activation layer of the neural network model; the coding formula used by the coding layer is the same as the coding formula in step S5;

and S7, replacing the multiplication operation of the matrix or the vector in the neural network with a plurality of binary bit operation operations, separating each bit of the encoded parameter and the activation value, and decomposing the multiplication operation between the model parameter and the activation value into a plurality of binary bit operation calculations after separation.

Without loss of generality, the coded activation value vector x ═ x is known¹,x²,.....,x^N]^TThe encoded neural network parameter vector w ═ w¹,w²,.....,w^N]^T。

The sum of the original inner products of x and w is expressed as follows:

wherein x is^TDenotes the transposition of x, xⁿDenotes the nth element in x, likewise wⁿN-th element, x, representing w^Tgw represents the inner product summation operation of x and w.

Vector x and each element of vector w are represented using { -1,1} encoding of the present invention:

wherein,

denotes xⁱValue of m-th bit of element andm represents the total number of bits of the encoded { -1,1} code.

And (3) transforming the vector x after the { -1,1} coding as follows:

wherein,the method comprises the steps of representing that the ith bit of each element in a vector is taken out to form a vector; the vector w is also transformed in the same way as the vector x

Referring to fig. 2, x is a vector consisting of full-precision real numbers, where x becomes a two-bit coded number after quantization coding, each bit { -1,1}, and w full-precision real numbers are also quantization coded in the same form as x; decomposing the coded x and w, separating the high order from the low order, using the low order of x and the high order of w to do the same or operation, using the high order of x and the high order of w to do the same or operation, if the low order is the same or operation, the coefficient is 2⁰·2⁰1 is ═ 1; the x low bit and the w high bit are operated in the same or way, and the coefficient is 2⁰·2¹And (2) and so on; rear 1/9 denotesThe inner product of the transit vector x and the vector w is represented as:

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Simulation conditions are as follows:

the hardware platform is as follows: intel (r) xeon (r) CPU Z480, 2.40GHz 16, 64G memory.

The software platform is as follows: mxnet

Simulation content and results:

the multi-bit coded resnet-18 classification model constructed by the method disclosed by the invention and the full-precision resnet-18 classification model constructed by the traditional method are respectively used for classifying the large-scale classification data set ImageNet.

The results are shown in table 1:

as can be seen from Table 1, the classification result of the 8-bit encoded resnet-18 classification model has higher accuracy than that of the full-precision resnet-18 classification model, and the parameters of the 8-bit encoded resnet-18 classification model are four times smaller and the classification speed is four times higher than that of the full-precision resnet-18 classification model. As the number of encoding bits decreases, the accuracy of classification decreases, but at the same time parameter compression and model acceleration increase.

The method is respectively used for constructing a multi-bit coded SSD target detection model and a full-precision SSD target detection model by a traditional method to carry out target detection on images of VOC target detection data sets.

Comparing the obtained target detection result with a real target mark according to the following two formulas:

recall is total number of detected correct targets/total number of actual targets

Accuracy rate is total number of detected correct targets/total number of detected targets

And drawing an accuracy-recall rate curve, obtaining the detection precision AP of the target detection according to the area of the curve, and averaging the APs of multiple categories to obtain the average precision mAP.

The results are shown in table 2:

	full precision	8 bit encoding	6 bit encoding	5 bit encoding
					mAP	0.6392	0.6351	0.6111	0.4540

As can be seen from table 2, although the difference between the average precision mapp obtained when the SSD target detection model with 8-bit codes and the SSD target detection model with full precision detect the target of the VOC is 0.041, the SSD target detection model with 8-bit codes compresses the model parameters and the speed of target detection is doubled. The SSD target detection model coded by the method continuously improves the compression of model parameters and the acceleration ratio of the model along with the reduction of the number of coded bits.

In summary, the invention decomposes the neural network matrix operation into a plurality of binary bit operations by recoding and decomposing the parameters and the eigenvalues of the deep neural network, thereby realizing the encoding, calculation and storage of data under different encoding precisions.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A compression and acceleration method of a neural network model based on { -1, +1} coding is characterized by firstly constructing the neural network model; then, carrying out segmentation processing on [ -1,1] according to the precision required by a specific task, and determining the bit number M for carrying out { -1,1} encoding on the model parameter and the activation value again according to the number of segments; initializing a neural network model and quantizing model parameters; training a neural network model and quantizing model parameters and activation values by using a linear quantization formula according to the number M of bits of the { -1,1} code in the training; carrying out M-bit number { -1,1} coding on the trained quantized neural network model parameters; adding an encoding layer for M-bit-1, 1-encoding the activation values again after each activation layer of the neural network model; and the multiplication operation of the matrix or the vector in the neural network is replaced by a plurality of binary bit operations, so that the model compression and acceleration are realized.

2. The method as claimed in claim 1, wherein the activation function of the neural network model is an activation function HReLU, and is as follows:

3. The method as claimed in claim 1, wherein the segmentation process comprises: will [ -1,1 [ ]]Is divided into 2^MSegment according to the number of segments 2^MThe number of bits M for subsequent re-1, 1 encoding of the model parameters and activation values is determined.

4. The method of claim 1, wherein a weight with a mean value of 0 and a variance of 1 is initialized, and the model parameter is quantized using a linear quantization formula according to the number M of bits of { -1,1} code.

5. The method of claim 4, wherein the linear quantization formula is as follows:

wherein,q^M(x) Denotes a quantization parameter value obtained under M-bit-1, 1 coding, x denotes a full precision parameter value, M denotes the number of bits of the-1, 1 coding,<·>indicating a rounding operation.

6. The method as claimed in claim 1, wherein the training quantized neural network model parameters are subjected to M-bit { -1,1} encoding using an encoding formula MBitEncoder, wherein the encoding formula MBitEncoder is as follows:

7. the method of claim 6 wherein the quantized value x is compressed and accelerated by a method of compressing a neural network model based on { -1, +1} encoding_qAfter M-bit { -1,1} coding, it can be expressed as:

x_qm∈{-1,1}

x_q→[x_qMx_q(M-1)x_q(M-2)...x_q1]

wherein,[]denotes x_qInner coding form of (1), x_qMDenotes x_qInner-1, 1 encodes the state of the Mth bit and x_qM∈{-1,1}。

8. The method of claim 1 wherein the neural network model compression and acceleration based on { -1, +1} encoding is performed by replacing matrix or vector multiplication operations within the neural network with binary bit operations, separating the encoded parameters from each bit of the activation value, and decomposing the multiplication between the model parameters and the activation value into binary bit operations.

9. The method of claim 8, wherein the inner product of the vector x of encoded activation values and the vector w of encoded neural network parameters is represented as:

10. The method of claim 9, wherein each element of the encoded activation value vector x is represented by { -1, +1} code as follows:

wherein,denotes xⁱValue of m-th bit of element and

and (3) transforming the vector x after the { -1,1} coding as follows: