CN110659725A

CN110659725A - Neural network model compression and acceleration method, data processing method and device

Info

Publication number: CN110659725A
Application number: CN201910893276.XA
Authority: CN
Inventors: 金庆; 杨林杰; 廖震宇
Original assignee: ByteDance Inc
Current assignee: ByteDance Inc
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2020-01-07
Anticipated expiration: 2039-09-20
Also published as: WO2021053381A1; CN110659725B

Abstract

A compression and acceleration method of a neural network model, a data processing method and device and a storage medium are provided. The neural network model comprises a linear layer, and parameters of the neural network model comprise preparation weight parameters; the compression and acceleration method comprises the following steps: quantizing parameters of the neural network model to obtain a quantized model, wherein the parameters of the quantized model comprise quantized weight parameters of a linear layer; and carrying out scale transformation processing on the quantization model to obtain a target quantization model. Carrying out scale transformation processing on the quantization model, wherein the scale transformation processing comprises the following steps: calculating scale transformation parameters of the linear layer based on the number of output neurons of the linear layer or the standard deviation of the preparation weight parameters of the linear layer; and carrying out scale transformation processing on the quantization weight parameter of the linear layer based on the scale transformation parameter of the linear layer to obtain a standard quantization weight parameter of the linear layer.

Description

Neural network model compression and acceleration method, data processing method and device

Technical Field

The embodiment of the disclosure relates to a compression and acceleration method of a neural network model, a data processing method and device and a storage medium.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Disclosure of Invention

At least one embodiment of the present disclosure provides a compression and acceleration method of a neural network model, the neural network model including a linear layer, parameters of the neural network model including preparatory weight parameters, the compression and acceleration method including: quantizing the parameters of the neural network model to obtain a quantization model, wherein the parameters of the quantization model comprise quantization weight parameters of the linear layer; carrying out scale transformation processing on the quantization model to obtain a target quantization model; wherein performing the scaling process on the quantization model comprises: calculating a scale transformation parameter of the linear layer based on the number of output neurons of the linear layer or a standard deviation of preparation weight parameters of the linear layer; and based on the scale transformation parameter of the linear layer, carrying out the scale transformation processing on the quantization weight parameter of the linear layer to obtain a standard quantization weight parameter of the linear layer.

For example, in the compression and acceleration methods provided in some embodiments of the present disclosure, the linear layer includes at least one selected from the group consisting of a convolutional layer, a recursive layer, and a fully-connected layer.

For example, in the compression and acceleration methods provided by some embodiments of the present disclosure, the linear layer is not directly followed by the batch normalization layer.

For example, in the compression and acceleration method provided by some embodiments of the present disclosure, quantizing parameters of the neural network model to obtain the quantization model includes: clamping the preparation weight parameter of the linear layer to obtain a clamping weight parameter of the linear layer; and carrying out quantization processing on the clamping weight parameters of the linear layer to obtain the quantization weight parameters of the linear layer.

For example, in the compression and acceleration method provided by some embodiments of the present disclosure, calculating the scaling parameter of the linear layer based on the number of output neurons of the linear layer includes: calculating the scale transformation parameters of the linear layer according to a first scale transformation parameter calculation formula, wherein the first scale transformation parameter calculation formula is expressed as:

wherein RSF represents a scaling parameter of the linear layer,

represents a number of output neurons of the linear layer, Q represents a quantization weight matrix of the linear layer, and VAR (Q) represents a variance of elements of the quantization weight matrix of the linear layer.

For example, in the compression and acceleration methods provided by some embodiments of the present disclosure, the number of bits of the quantization weight parameter of the linear layer is 1 to 8.

For example, in the compression and acceleration methods provided in some embodiments of the present disclosure, the number of bits of the quantization weight parameter of the linear layer is 1-2.

For example, in the compression and acceleration method provided by some embodiments of the present disclosure, calculating the scaling parameter of the linear layer based on the number of output neurons of the linear layer includes: calculating the scale transformation parameters of the linear layer according to a second scale transformation parameter calculation formula, wherein the second scale transformation parameter calculation formula is expressed as:

wherein RSF represents a scaling parameter of the linear layer,representing the number of output neurons of the linear layer,

an auxiliary weight matrix representing the linear layer,

representing a variance of an element of an auxiliary weight matrix of the linear layer;

the auxiliary weight matrix of the linear layer

Expressed as:

wherein,

a clamp weight matrix representing the linear layer.

For example, in the compression and acceleration methods provided by some embodiments of the present disclosure, calculating the scaling parameters of the linear layer based on the standard deviation of the preparation weight parameters of the linear layer includes: calculating the scale transformation parameters of the linear layer according to a third scale transformation parameter calculation formula, wherein the third scale transformation parameter calculation formula is expressed as:

wherein RSF represents a scaling parameter of the linear layer, W represents a preparation weight matrix of the linear layer, VAR (W) represents a variance of elements of the preparation weight matrix of the linear layer,an auxiliary weight matrix representing the linear layer,

of elements of an auxiliary weight matrix representing said linear layerVariance;

the auxiliary weight matrix of the linear layer

Expressed as:

wherein,

a clamp weight matrix representing the linear layer.

For example, in the compression and acceleration methods provided in some embodiments of the present disclosure, the number of bits of the quantization weight parameter of the linear layer is 3 to 8.

For example, in the compression and acceleration methods provided in some embodiments of the present disclosure, the performing the scaling process on the quantization weight parameter of the linear layer based on the scaling parameter of the linear layer to obtain a standard quantization weight parameter of the linear layer includes: and carrying out the scale transformation processing on the quantization weight parameters of the linear layer according to a scale transformation formula, wherein the scale transformation formula is expressed as follows:

wherein Q is^*A standard quantization weight matrix representing the linear layer,

representing the parameter of the ith row and the jth column of the standard quantization weight matrix of the linear layer, Q representing the quantization weight matrix of the linear layer, Q_ijAnd representing the parameter of the ith row and the jth column of the quantization weight matrix of the linear layer.

For example, in some embodiments of the present disclosure, in a compression and acceleration method, performing the clamping processing on the preparation weight parameters of the linear layer to obtain clamping weight parameters of the linear layer includes: performing the clamping processing on the preparation weight parameter of the linear layer according to a clamping formula, wherein the clamping formula is expressed as:

wherein,

a clamping weight matrix representing the linear layer,represents the parameters of the ith row and the jth column of the clamped weight matrix, W represents the preparation weight matrix of the linear layer, W_ijA parameter, W, representing the ith row and the jth column of the preparation weight matrix for the linear layer_mnThe parameter of the nth column of the mth row of the preparation weight matrix of the linear layer is represented, tanh (·) represents a hyperbolic tangent function, and max (·) represents a max-valued function.

For example, in some embodiments of the present disclosure, in a compression and acceleration method, performing the quantization on the clamp weight parameter of the linear layer to obtain a quantization weight parameter of the linear layer includes: performing the quantization processing on the clamp weight parameter of the linear layer according to a quantization formula, wherein the quantization formula is expressed as:

wherein Q represents a quantization weight matrix of the linear layer, Q_ijThe parameter of the ith row and the jth column of the quantization weight matrix of the linear layer is represented, b represents the number of bits of a quantization bit, and round (·) represents a rounding function.

For example, some embodiments of the present disclosure provide a compression and acceleration method, further including: and training the target quantization model by adopting the same training parameter configuration as the neural network model.

For example, in the compression and acceleration method provided by some embodiments of the present disclosure, the training process of the target quantization model includes: a forward propagation stage, a backward propagation stage and a standard quantization stage; the forward propagation phase comprises: processing training input data by using a current target quantization model to obtain training output data, and calculating a loss value based on the training output data; the back propagation phase comprises: calculating a gradient based on the loss value, and correcting parameters of the current neural network model based on the gradient to obtain an updated neural network model; the standard quantization stage comprises: quantizing parameters of the updated neural network model to obtain an updated quantization model, and performing scale transformation processing on the updated quantization model to obtain an updated target quantization model.

For example, in compression and acceleration methods provided by some embodiments of the present disclosure, the neural network model includes an activation layer that includes a PACT activation function represented as:

wherein,

represents the output of the active layer, x represents the input of the active layer, and α represents the activation value parameter of the PACT activation function;

quantifying parameters of the neural network model to obtain the quantified model, further comprising:

performing the quantization process on the output of the active layer according to an active value quantization formula, the active value quantization formula being represented as:

where q represents a quantized value of the output of the active layer, a represents the number of bits of the quantized value of the output of the active layer, and round (·) represents a rounding function.

For example, in the compression and acceleration methods provided by some embodiments of the present disclosure, the back propagation stage further includes: calculating an activation value gradient according to an activation value gradient formula, and correcting a current activation value parameter based on the activation value gradient to obtain an updated activation value parameter, wherein the activation value gradient formula is expressed as:

wherein,

representing the activation value gradient.

For example, in the compression and acceleration method provided in some embodiments of the present disclosure, the training parameter configuration includes: initial learning rate, learning rate adjustment scheme, weight attenuation, iteration times of a training set, optimizer and batch size.

For example, in the compression and acceleration method provided in some embodiments of the present disclosure, before quantizing the parameters of the neural network model, the compression and acceleration method further includes: and pre-training the neural network model to obtain a preparation weight parameter of the neural network model.

For example, in the compression and acceleration method provided by some embodiments of the present disclosure, the pre-training of the neural network model includes: parameters of the neural network model are initialized using an happy-inch initialization scheme.

For example, in some embodiments of the present disclosure providing methods of compression and acceleration, the neural network model includes one of ResNet, MobileNet-V1, MobileNet-V2, and VGG-Net.

At least one embodiment of the present disclosure further provides a data processing method, including: the target quantization model obtained by adopting the compression and acceleration method provided by any embodiment of the disclosure is used for processing input data.

At least one embodiment of the present disclosure also provides a data processing apparatus, including: a memory for non-transitory storage of computer readable instructions; and a processor for executing computer readable instructions; wherein the computer readable instructions, when executed by the processor, perform the compression and acceleration methods provided by any of the embodiments of the present disclosure or perform the data processing methods provided by any of the embodiments of the present disclosure.

At least one embodiment of the present disclosure also provides a storage medium that stores computer-readable instructions non-temporarily, wherein the non-transitory computer-readable instructions, when executed by a computer, may perform instructions of the compression and acceleration method provided by any embodiment of the present disclosure or may perform instructions of the data processing method provided by any embodiment of the present disclosure.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.

FIG. 1 is a schematic diagram of a convolutional neural network;

FIG. 2A is a schematic diagram of a convolutional neural network;

FIG. 2B is a schematic diagram of the operation of a convolutional neural network;

FIG. 3 is a schematic diagram of another convolutional neural network;

fig. 4 is a flowchart of a method for compressing and accelerating a neural network model according to at least one embodiment of the present disclosure;

fig. 5 is an exemplary flowchart corresponding to step S100 shown in fig. 4 provided in at least one embodiment of the present disclosure;

fig. 6 is another exemplary flowchart corresponding to step S100 shown in fig. 4 provided in at least one embodiment of the present disclosure;

fig. 7 is an exemplary flowchart corresponding to step S200 shown in fig. 4 provided in at least one embodiment of the present disclosure;

fig. 8 is an exemplary flowchart corresponding to step S300 shown in fig. 4 provided in at least one embodiment of the present disclosure;

fig. 9 is a schematic block diagram of a data processing apparatus according to at least one embodiment of the present disclosure; and

fig. 10 is a schematic diagram of a storage medium according to at least one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

The present disclosure is illustrated by the following specific examples. To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of known functions and known components have been omitted from the present disclosure. When any component of an embodiment of the present disclosure appears in more than one drawing, that component is represented by the same or similar reference numeral in each drawing.

Among the algorithm technologies in the AI field, Deep Learning (Deep Learning) is widely concerned by academia and industry, and scientists, researchers, enterprises, network communities, etc. all are energetically studying and promoting the research and development of Deep Learning neural network models.

With the breakthrough and progress of deep learning in the fields of image classification, target detection, natural language processing and the like, the demand for applying the deep learning to the actual life scene is stronger. Currently, mobile and portable electronic devices greatly facilitate people's lives, and deep learning will greatly improve the intelligence and entertainment of these devices. Therefore, it is highly desirable to deploy the deep learning neural network model in the mobile terminal and the embedded system.

However, in actual deployment, a neural network model applying deep learning generally faces a problem of an oversize model, for example, a file size of the neural network model generally varies from tens of megabytes to hundreds of megabytes, and such a file size is unbearable for a mobile terminal by a user due to an excessively long transmission waiting time caused by a consumed flow and a bandwidth influence during downloading; especially for some embedded systems with limited storage space, there may not be enough storage space at all to store such a large neural network model file. Meanwhile, the deep learning neural network model has high requirements on computing resources and computing power; when a large neural network model is used for calculation, the mobile terminal and the embedded system either cannot provide required calculation resources or are slow in calculation, so that response delay is too high to meet the actual application scene. In addition, the neural network model also consumes a large amount of power. In the calculation process of the neural network, the processor needs to frequently read the parameters of the neural network model, so that a larger neural network model correspondingly brings higher memory access times, the frequent memory access can also greatly improve the power consumption, and the high power consumption is not beneficial to deploying the neural network model at a mobile terminal.

Therefore, in order to deploy a well-performing neural network on a resource-limited hardware device, the neural network model needs to be compressed and accelerated. Because the quantization model can be transplanted on hardware very conveniently, the method for quantizing the neural network model has great development potential in a plurality of methods for compressing and accelerating the neural network model.

At least one embodiment of the present disclosure provides a method for compressing and accelerating a neural network model. The neural network model comprises a linear layer, and parameters of the neural network model comprise preparation weight parameters; the compression and acceleration method comprises the following steps: quantizing parameters of the neural network model to obtain a quantized model, wherein the parameters of the quantized model comprise quantized weight parameters of a linear layer; and carrying out scale transformation processing on the quantization model to obtain a target quantization model. Wherein, carrying out scale transformation processing on the quantization model comprises the following steps: calculating scale transformation parameters of the linear layer based on the number of output neurons of the linear layer or the standard deviation of the preparation weight parameters of the linear layer; and carrying out scale transformation processing on the quantization weight parameter of the linear layer based on the scale transformation parameter of the linear layer to obtain a standard quantization weight parameter of the linear layer.

Some embodiments of the present disclosure also provide a data processing method and apparatus, and a storage medium corresponding to the compression and acceleration method.

According to the compression and acceleration method of the neural network model, the target quantization model is obtained by carrying out scale transformation processing on the quantization model, the precision of the target quantization model can be improved, and the performance of the target quantization model is improved.

Originally, Convolutional Neural Networks (CNNs) were primarily used to identify two-dimensional shapes that were highly invariant to translation, scaling, tilting, or other forms of deformation of images. CNN simplifies the complexity of neural network models and reduces the number of weights mainly by local perceptual field and weight sharing. With the development of deep learning technology, the application range of CNN has not only been limited to the field of image recognition, but also can be applied to the fields of face recognition, character recognition, animal classification, image processing, and the like.

Fig. 1 shows a schematic diagram of a convolutional neural network. For example, the convolutional neural network may be used for image processing, which uses images as input and output and replaces scalar weights by convolutional kernels. Only a convolutional neural network having a 3-layer structure is illustrated in fig. 1, and embodiments of the present disclosure are not limited thereto. As shown in fig. 1, the convolutional neural network includes an input layer 101, a hidden layer 102, and an output layer 103. The input layer 101 has 4 inputs, the hidden layer 102 has 3 outputs, the output layer 103 has 2 outputs, and finally the convolutional neural network finally outputs 2 images.

For example, the 4 inputs to the input layer 101 may be 4 images, or four feature images of 1 image. The 3 outputs of the hidden layer 102 may be feature images of the image input via the input layer 101.

For example, as shown in FIG. 1, the convolutional layers have weights

And bias

Weight of

Representing convolution kernels, offsets

Is a scalar superimposed on the output of the convolutional layer, where k is a label representing the input layer 101 and i and j are labels of the elements of the input layer 101 and the elements of the hidden layer 102, respectively. For example, the first convolution layer 201 includes a first set of convolution kernels (of FIG. 1)

) And a first set of offsets (of FIG. 1

). The second convolutional layer 202 includes a second set of convolutional kernels (of FIG. 1)

) And a second groupBiasing (of FIG. 1

). Typically, each convolutional layer comprises tens or hundreds of convolutional kernels, which may comprise at least five convolutional layers if the convolutional neural network is a deep convolutional neural network.

For example, as shown in fig. 1, the convolutional neural network further includes a first activation layer 203 and a second activation layer 204. A first active layer 203 is located behind the first convolutional layer 201, and a second active layer 204 is located behind the second convolutional layer 202. The activation layers (e.g., the first activation layer 203 and the second activation layer 204) include activation functions that are used to introduce non-linear factors into the convolutional neural network so that the convolutional neural network can better solve more complex problems. The activation function may include a linear modification unit (ReLU) function, a Sigmoid function (Sigmoid function), or a hyperbolic tangent function (tanh function), etc. The ReLU function is a non-saturated non-linear function, and the Sigmoid function and the tanh function are saturated non-linear functions. For example, the activation layer may be solely a layer of the convolutional neural network, or the activation layer may be included in a convolutional layer (e.g., the first convolutional layer 201 may include the first activation layer 203, and the second convolutional layer 202 may include the second activation layer 204).

For example, in the first convolution layer 201, first, a number of convolution kernels of the first set of convolution kernels are applied to each input

And a number of biases of the first set of biasesTo obtain the output of the first convolution layer 201; the output of first buildup layer 201 can then be processed through first active layer 203 to obtain the output of first active layer 203. In the second convolutional layer 202, first, several convolutional kernels of the second set of convolutional kernels are applied to the output of the first active layer 203 which is input

And a firstSeveral of the two sets of biases

To obtain the output of the second convolutional layer 202; the output of second convolutional layer 202 may then be processed by second active layer 204 to obtain the output of second active layer 204. For example, the output of the first convolution layer 201 may be the application of a convolution kernel to its input

Then is offset with

As a result of the addition, the output of the second convolutional layer 202 may apply a convolutional kernel to the output of the first active layer 203

Then is offset with

The result of the addition.

Before image processing is performed by using the convolutional neural network, the convolutional neural network needs to be trained. After training, the convolution kernel and bias of the convolutional neural network remain unchanged during image processing. In the training process, each convolution kernel and bias are adjusted through a plurality of groups of input/output example images and an optimization algorithm to obtain an optimized convolution neural network model.

Fig. 2A shows a schematic structural diagram of a convolutional neural network, and fig. 2B shows a schematic operational process diagram of a convolutional neural network. For example, as shown in fig. 2A and 2B, after the input image is input to the convolutional neural network through the input layer, the class identifier is output after several processing procedures (e.g., each level in fig. 2A) are performed in sequence. The main components of a convolutional neural network may include a plurality of convolutional layers, a plurality of downsampling layers, and a fully-connected layer. For example, a complete convolutional neural network may be composed of a stack of these three layers. For example, fig. 2A shows only three levels of a convolutional neural network, namely a first level, a second level, and a third level. For example, each tier may include a convolution module and a downsampling layer. For example, each convolution module may include a convolution layer. Thus, the processing procedure of each hierarchy may include: the input image is convolved (convolution) and downsampled (sub-sampling/down-sampling). For example, each convolution module may further include a batch normalization (batch normalization) layer according to actual needs, so that the processing procedure of each level may further include batch normalization processing.

For example, the batch normalization layer is used for performing batch normalization processing on the feature map so as to change the gray value of the pixel of the feature image within a predetermined range, thereby reducing the calculation difficulty and improving the contrast. For example, the predetermined range may be [ -1, 1 ]. For example, the processing manner of the batch normalization layer may refer to a common batch normalization processing process, and is not described herein again.

Convolutional layers are the core layers of convolutional neural networks. In the convolutional layer of the convolutional neural network, one neuron is connected with only part of the neurons of the adjacent layer. The convolutional layer may apply several convolutional kernels (also called filters) to the input image to extract various types of features of the input image. Each convolution kernel may extract one type of feature. The convolution kernel is generally initialized in the form of a random decimal matrix, and the convolution kernel can be learned to obtain a reasonable weight in the training process of the convolutional neural network. The result obtained after applying a convolution kernel to the input image is called a feature image (feature map), and the number of feature images is equal to the number of convolution kernels. Each characteristic image is composed of a plurality of neurons arranged in a rectangular shape, and the neurons of the same characteristic image share a weight value, wherein the shared weight value is a convolution kernel. The feature images output by a convolutional layer of one level may be input to an adjacent convolutional layer of the next level and processed again to obtain new feature images. For example, as shown in fig. 2A, a first level of convolutional layers may output a first feature image, which is input to a second level of convolutional layers for further processing to obtain a second feature image.

For example, as shown in fig. 2B, the convolutional layer may use different convolutional cores to convolve the data of a certain local perceptual domain of the input image, and the convolution result is input to the active layer, which performs calculation according to the corresponding activation function to obtain the feature information of the input image.

For example, as shown in fig. 2A and 2B, a downsampled layer is disposed between adjacent convolutional layers, which is one form of downsampling. On one hand, the down-sampling layer can be used for reducing the scale of an input image, simplifying the complexity of calculation and reducing the phenomenon of overfitting to a certain extent; on the other hand, the downsampling layer may perform feature compression to extract main features of the input image. The downsampling layer can reduce the size of the feature images without changing the number of feature images. For example, an input image of size 12 × 12, which is sampled by a convolution kernel of 6 × 6, then a 2 × 2 output image can be obtained, which means that 36 pixels on the input image are combined to 1 pixel in the output image. The last downsampled or convolutional layer may be connected to one or more fully-connected layers that are used to connect all the extracted features. The output of the fully connected layer is a one-dimensional matrix, i.e., a vector.

Fig. 3 shows a schematic structural diagram of another convolutional neural network. For example, referring to the example shown in FIG. 3, the output of the last convolutional layer (i.e., the t-th convolutional layer) is input to a planarization layer for a planarization operation (Flatten). The planarization layer may convert the feature image (2D image) into a vector (1D). The planarization operation may be performed as follows:

v_k＝f_k/j,k％j

where v is a vector containing k elements and f is a matrix with i rows and j columns.

The output of the planarization layer (i.e., the 1D vector) is then input to a fully connected layer (FCN). The fully-connected layer may have the same structure as the convolutional neural network, but differs in that the fully-connected layer uses a different scalar value instead of the convolution kernel.

For example, the output of the last convolutional layer may also be input to an averaging layer (AVG). The averaging layer is used to average the output, i.e. represent the output image with the mean of the feature images, so that a 2D feature image is converted into a scalar. For example, if a convolutional neural network includes an equalization layer, it may not include a planarization layer.

For example, according to actual needs, the equalization layer or the full-link layer may be connected to a classifier, the classifier may perform classification according to the extracted features, and the output of the classifier may be used as the final output of the convolutional neural network, i.e., a class identifier (label) representing a class of an image.

For example, the classifier may be a Support Vector Machine (SVM) classifier, a softmax classifier, a nearest neighbor rule (KNN) classifier, and the like. As shown in fig. 3, in one example, the convolutional neural network includes a softmax classifier, which is a generator of a logic function that can compress a K-dimensional vector z containing arbitrary real numbers into a K-dimensional vector σ (z). The formula of the softmax classifier is as follows:

wherein Z is_jRepresents the jth element in a K-dimensional vector z, σ (z) represents the prediction probability of each class identifier (label), σ (z) is a real number and ranges from (0,1), and the sum of the K-dimensional vectors σ (z) is 1. According to the above formula, each class identifier in the K-dimensional vector z is given a certain prediction probability, and the class identifier having the largest prediction probability is selected as the identifier or class of the input image.

Some embodiments of the present disclosure and examples thereof are described in detail below with reference to the accompanying drawings.

Fig. 4 is a flowchart of a method for compressing and accelerating a neural network model according to at least one embodiment of the present disclosure. For example, the compression and acceleration method can be used for quantifying various neural network models such as ResNet (e.g., ResNet-50), MobileNet-V1, MobileNet-V2, VGG-Net, and the like, so as to realize compression and acceleration of the various neural network models. It should be noted that the applicable scope of the compression and acceleration method includes, but is not limited to, the above listed neural network models.

For example, as shown in fig. 4, the compression and acceleration method includes steps S000 to S300.

Step S000: and pre-training the neural network model to obtain a preparation weight parameter of the neural network model.

For example, in step S000, the neural network model may be an untrained full-precision model (full-precision model). For example, the full-precision model may be pre-trained using conventional training methods, training techniques (ticks), and training parameter (e.g., including hyper-parameters) configurations.

For example, training parameter configuration typically includes: initial learning rate (initial learning rate), learning rate adjustment scheme (learning rate scheduler), weight decay (weight decay), number of iterations of training set (the epoch), optimizer (optimizer), batch size (batch size), and the like. For example, in some examples, the initial learning rate may be set to 0.05, the learning rate adjustment scheme may employ a cosine annealing adjustment scheme (cosine annealing scheduler), and the weight attenuation may be set to 4 × 10^-5The number of iterations of the training set may be set to 150, the optimizer may employ a Stochastic Gradient Descent (SGD) optimizer, the batch size may be set to 2048 or 1024, and so on. It should be noted that the above training parameter configuration is exemplary and should not be considered as limiting the present disclosure. In the embodiment of the present disclosure, the training parameter configuration may be set according to actual needs.

For example, the pre-training process of neural network models typically includes: initializing parameters of the neural network model; processing training input data by using a neural network model to obtain training output data; calculating a loss value through a loss function based on the training output data; gradients are calculated based on the loss values and parameters of the neural network model are modified.

For example, in some examples, an happy ming Initialization (Kaiming Initialization) scheme may be employed to initialize parameters of the neural network model. For example, the parameters of the neural network model may be initialized to random numbers that conform to a gaussian distribution. For example, the initial weight parameters of each functional layer (e.g., convolutional layer, fully-connected layer, etc.) of the neural network model may be made to conform to a gaussian distribution, e.g., the expectation of the gaussian distribution is 0, and the standard deviation of the gaussian distribution is the inverse of the number of output neurons of that functional layer. For example, for a convolutional layer, the number of output neurons of the convolutional layer is equal to the product of the number of output channels of the convolutional layer and the number of elements in the convolutional kernel of the convolutional layer; for example, for a fully-connected layer, the number of output neurons of the fully-connected layer is equal to the number of features output by the fully-connected layer.

For example, in some examples, the type of training input data depends on the processing objectives of the neural network model, e.g., the training input data may include images, text, speech, etc., depending on the processing objectives of the neural network model. Taking neural network models such as ResNet, Mobile Net-V1, Mobile Net-V2, and VGG-Net as examples, the training input data may be images, and images in an ImageNet database may be used as the training input data.

For example, in some examples, the loss function may be selected according to actual needs, for example, the loss function may include, but is not limited to, one or any combination of a 0-1 loss function, a square loss function, a logarithmic loss function, a cross-entropy loss function (cross-entropy cost function), and the like, which is not limited by the embodiments of the disclosure.

For example, in some examples, a random gradient descent (stochastic gradient descent) algorithm, a Batch Gradient Descent (BGD) algorithm, or the like may be used to calculate the gradient and modify the parameters of the neural network model based on the gradient.

For example, in some examples, the pre-training process of the neural network model may further include: judging whether the training of the neural network model meets a preset condition or not, and if not, repeatedly training the neural network model; and if the preset conditions are met, stopping training the neural network model to obtain the trained neural network model. For example, in one example, the predetermined condition is that the loss value corresponding to the training input data is no longer significantly reduced; for example, in another example, the predetermined condition is that the number of times of training or the training period of the neural network model reaches a predetermined number; embodiments of the present disclosure are not limited in this regard.

It should be noted that the above description only schematically illustrates the training process of the neural network model. Those skilled in the art will appreciate that in the training process, a large amount of sample data is required to train the neural network model; meanwhile, in the training process of each sample data, a plurality of repeated iterations can be included to correct the parameters of the neural network model. As another example, the training phase may also include fine-tuning (fine-tune) parameters of the neural network model to obtain more optimal parameters.

For example, in some examples, the neural network model includes linear layers, e.g., the linear layers include at least one of convolutional layers (convolution layer), recursive layers (recursive layer), and fully-connected layers (full-connected layer). For example, in some examples, the neural network model also includes non-linear layers, e.g., the non-linear layers include a batch normalization layer (batch normalization layer) and an activation layer (activation layer, e.g., employing a non-linear activation function), and so on.

For example, after pre-training, the parameters of the neural network model are the preparatory weight parameters. For example, in some examples, the provisioning weight parameter is a full precision 32-bit floating point number. It should be noted that, in some examples, the compression and acceleration method provided by the embodiments of the present disclosure may not include step S000, for example, steps S100 to S300 may be performed directly based on a neural network model that is trained in the art to obtain a target quantization model. In this case, the parameters of the trained neural network model are the preparatory weight parameters.

Step S100: and quantizing the parameters of the neural network model to obtain a quantized model.

For example, in step S100, parameters of the neural network model may be quantified using a DoReFa scheme. For example, quantizing parameters of the neural network model refers to changing at least some parameters of the neural network model from, for example, a high-precision floating point number (for example, a full-precision 32-bit floating point number) to, for example, a low-precision fixed point number (for example, a 1-8-bit fixed point number), thereby compressing and accelerating the neural network model. It should be noted that, in step S100, other types of quantization schemes may also be used to quantize the parameters of the neural network model, and the embodiment of the present disclosure is not limited thereto. Hereinafter, the quantization process in step S100 is explained in detail based on the DoReFa scheme. For example, specific details of the DoReFa protocol can be found in Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. This document is hereby incorporated by reference in its entirety as part of the present disclosure.

Fig. 5 is an exemplary flowchart corresponding to step S100 shown in fig. 4 provided in at least one embodiment of the present disclosure. For example, as shown in fig. 5, the parameters of the neural network model are quantized to obtain a quantized model, i.e., step S100, which includes steps S110 to S120.

Step S110: and clamping the preparation weight parameter of the linear layer to obtain a clamping weight parameter of the linear layer.

For example, "clipping" refers to scaling a set of parameters (e.g., preparatory weight parameters of a linear layer) according to a certain rule (e.g., according to a certain formula), so that the value range of the scaled parameters is limited to a certain interval, so as to facilitate subsequent further processing. For example, in some examples, the preparation weight parameter of the linear layer may be clamped according to a clamping formula to limit a value range of the clamping weight parameter of the linear layer to a predetermined interval, for example, the predetermined interval may be [0,1], but is not limited thereto. For example, by the clamping process, the distribution of the parameters of the linear layer (i.e., the clamping weight parameters of the linear layer) in the predetermined interval can be made more uniform, thereby being beneficial to reducing quantization errors in subsequent steps. For example, in some examples, the clamp formula may be expressed as:

wherein,

a clamp weight matrix representing the linear layer (including clamp weight parameters of the linear layer),

the parameters of the ith row and the jth column of the clamping weight matrix are represented, W represents the preparation weight matrix of the linear layer (including the preparation weight parameters of the linear layer), W_ijParameter, W, representing the ith row and jth column of the preparatory weight matrix for a linear layer_mnThe parameter of the mth row and nth column of the preparation weight matrix of the linear layer is represented, tanh (·) represents a hyperbolic tangent function, and max (·) represents a max-valued function.

For example, the above-mentioned clipping formula can limit the value range of the clipping weight parameter of the linear layer to the interval [0,1 ].

Step S120: and carrying out quantization processing on the clamping weight parameters of the linear layer to obtain the quantization weight parameters of the linear layer.

For example, in some examples, the clamp weight parameters of the linear layer may be quantized according to a weight quantization formula to obtain the quantized weight parameters of the linear layer. For example, in some examples, the weight quantization formula may be expressed as:

wherein Q represents the quantization weight matrix of the linear layer (including the quantization weight parameter of the linear layer), Q_ijThe parameter of the ith row and the jth column of the quantization weight matrix of the linear layer is represented, b represents the number of bits of the quantization weight parameter of the linear layer, and round (·) represents a rounding function.

For example, the parameters of the quantization model include quantization weight parameters of the linear layer. For example, to facilitate the transfer of the quantization model to the mobile terminal and the embedded system, the bit number b of the quantization weight parameter of the linear layer is generally set to 1-8 bits (bit). Of course, the number of bits of the quantization weight parameter of the linear layer may also be set to more bits as needed, which is not limited by the embodiments of the present disclosure.

Fig. 6 is another exemplary flowchart corresponding to step S100 shown in fig. 4 provided in at least one embodiment of the present disclosure. Step S100 shown in fig. 6 includes step S130 in addition to step S110 and step S120 shown in fig. 5.

For example, in some examples, the neural network model includes an activation layer. For example, the activation layer may include a PACT activation function, but is not limited to such. For example, the PACT activation function is expressed as:

wherein,

the output of the activation layer, x represents the input of the activation layer, and α represents the activation value parameter of the PACT activation function. For example, α is a floating number (floating number). For example, the PACT activation function may reduce quantization error of the output of the active layer.

For example, as shown in fig. 6, the parameters of the neural network model are quantized to obtain a quantized model, i.e., step S100, which further includes step S130.

Step S130: and carrying out quantization processing on the output of the active layer.

For example, in some examples, the output of the active layer may be quantized according to an active value quantization formula. For example, the activation value quantization formula may be expressed as:

where q represents the quantized value of the output of the active layer, a represents the number of bits of the quantized value of the output of the active layer, and round (·) represents a rounding function. For example, q is a dynamic fixed-point number (dynamic fixed-point number); for example, the number a of bits of the quantized value of the output of the active layer is generally set to, for example, 1 to 8 bits, for example, 2 to 4 bits.

For example, in the embodiment of the present disclosure, the output of the active layer is quantized, which is beneficial to increasing the operation speed of the quantization model, so as to be beneficial to implementing the acceleration function of the compression and acceleration method provided by the embodiment of the present disclosure.

It should be noted that, in the embodiment of the present disclosure, the quantization process may not be performed on the batch normalization layer in the neural network model, or may not be performed on the bias (bias) of the last fully-connected layer in the neural network model.

In the research, the inventors of the present application found that: on one hand, the quantization model obtained according to step S100 generally has the problems of accuracy degradation and performance degradation; on the other hand, in the neural network model or/and the quantitative model, if the gradient of the weight is kept at the same scale order, the problems of gradient explosion and gradient disappearance can be prevented, thereby being beneficial to improving the precision of the quantitative model and improving the performance of the quantitative model. For example, in order to keep the gradient of the weight at the same scale order, in the neural network model, a batch normalization layer may be directly connected after the linear layer (the output of the linear layer is processed by the batch normalization layer and then input into a subsequent functional layer); however, in the neural network model, there are often also linear layers not directly followed by the batch normalization layer, for example, the last fully-connected layer for output in the neural network model such as ResNet, MobileNet-V1, MobileNet-V2, and VGG-Net. Therefore, the compression and acceleration method provided by the embodiment of the present disclosure further includes, after step S100, step S200 to further process the quantization model.

Step S200: and carrying out scale transformation processing on the quantization model to obtain a target quantization model.

For example, in some examples, the target quantization model obtained in step S200 may have higher accuracy and better performance than the quantization model obtained in step S100 under the same efficiency constraints (efficiencies constraints). For example, the same efficiency constraint means that the size of the model (corresponding to the memory space occupied by the model), power consumption, latency (corresponding to the processing speed of the model), and the like are substantially the same. For example, in some examples, the performance of the target quantization model obtained in step S200 may be comparable to or better than the performance of the corresponding full-precision model (see subsequent tables 1-2).

Fig. 7 is an exemplary flowchart corresponding to step S200 shown in fig. 4 provided in at least one embodiment of the present disclosure. For example, as shown in fig. 7, the quantization model is subjected to a scaling process to obtain a target quantization model, i.e., step S200 includes steps S210 to S220.

Step S210: and calculating the scale transformation parameter of the linear layer based on the number of output neurons of the linear layer or the standard deviation of the preparation weight parameter of the linear layer.

For example, in some examples, scaling parameters for a linear layer are calculated based on a number of output neurons for the linear layer, including: and calculating the scale transformation parameters of the linear layer according to a first scale transformation parameter calculation formula. For example, the first scaling parameter calculation formula is expressed as:

wherein RSF represents the scaling parameters of the linear layer,

represents the number of output neurons of the linear layer, Q represents the quantization weight matrix of the linear layer (including the quantization weight parameters of the linear layer), and var (Q) represents the variance of the elements of the quantization weight matrix of the linear layer.

For example, in some examples, when the number of bits of the quantization weight parameter of the linear layer is 1-2 bits, the scaling parameter RSF of the linear layer calculated by using the first scaling parameter calculation formula may cause the target quantization model to converge faster than the scaling parameter RSF of the linear layer calculated by using the subsequent two scaling parameter calculation formulas. It should be noted that, in the embodiment of the present disclosure, when the number of bits of the quantization weight parameter of the linear layer is other values (for example, 3-8 bits), the scaling parameter RSF of the linear layer may still be calculated by using the first scaling parameter calculation formula.

For example, in other examples, calculating the scaling parameters for the linear layer based on the number of output neurons for the linear layer includes: and calculating the scale transformation parameters of the linear layer according to a second scale transformation parameter calculation formula. For example, the second scaling parameter calculation formula is expressed as:

wherein RSF represents the scaling parameters of the linear layer,representing the number of output neurons of the linear layer,

an auxiliary weight matrix representing the linear layer,

the variance of the elements of the auxiliary weight matrix representing the linear layer. Auxiliary weight matrix of linear layer

Expressed as:

wherein,

representing the clamping weight matrix of the linear layer.

It should be noted that, in the above example, the auxiliary weight matrix of the linear layer

Is introduced for explaining the calculation formula of the second scale transformation parameter, and does not include the auxiliary weight matrix of the linear layer in the neural network model and the quantization model thereof

For example, in still other examples, calculating the scaling parameters for the linear layer based on the standard deviation of the preparation weight parameters for the linear layer includes: and calculating the scale transformation parameters of the linear layer according to a third scale transformation parameter calculation formula. For example, the third scaling parameter calculation formula is expressed as:

wherein RSF represents a scaling parameter of the linear layer, W represents a preparation weight matrix of the linear layer, VAR (W) represents a variance of elements of the preparation weight matrix of the linear layer,

an auxiliary weight matrix representing the linear layer,the variance of the elements of the auxiliary weight matrix representing the linear layer. Auxiliary weight matrix of linear layer

Expressed as:

wherein,

representing the clamping weight matrix of the linear layer.

Is introduced for explaining the calculation formula of the third scale transformation parameter, and does not include the auxiliary weight matrix of the linear layer in the neural network model and the quantization model thereof

It should be noted that, in some examples, the accuracy and the performance of the target quantization model obtained based on the scaling parameter RSF of the linear layer calculated by the first scaling parameter calculation formula, the target quantization model obtained based on the scaling parameter RSF of the linear layer calculated by the second scaling parameter calculation formula, and the target quantization model obtained based on the scaling parameter RSF of the linear layer calculated by the third scaling parameter calculation formula are substantially equivalent.

For example, in some examples, when the number of bits of the quantization weight parameter of the linear layer is 3 to 8, any one of the first scaling parameter calculation formula, the second scaling parameter calculation formula, and the third scaling parameter calculation formula may be selected to calculate the scaling parameter RSF of the linear layer, and meanwhile, the accuracy and the performance of the obtained target quantization model are substantially equivalent. It should be noted that, in at least one embodiment of the present disclosure, when the number of bits of the quantization weight parameter of the linear layer is other values (for example, 1-2 bits), the scaling parameter RSF of the linear layer may still be calculated by using the second scaling parameter calculation formula or the third scaling parameter calculation formula.

Step S220: and carrying out scale transformation processing on the quantization weight parameters of the linear layer based on the scale transformation parameters of the linear layer to obtain standard quantization weight parameters of the linear layer.

For example, in some examples, scaling quantization weight parameters of a linear layer (e.g., a linear layer not directly followed by a batch normalization layer) based on scaling parameters of the linear layer facilitates maintaining gradients of weights in a quantization model at the same scale order, thereby facilitating improving accuracy and performance of the quantization model.

For example, in some examples, the quantization weight parameters of the linear layers may be scaled according to a scaling formula. For example, the scaling formula may be expressed as:

wherein Q is^*A standard quantization weight matrix representing the linear layer (including standard quantization weight parameters of the linear layer),representing the parameters of the ith row and the jth column of the standard quantization weight matrix of the linear layer, Q representing the quantization weight matrix of the linear layer, Q_ijAnd representing the parameter of the ith row and the jth column of the quantization weight matrix of the linear layer.

It should be noted that, in the embodiment of the present disclosure, only the quantization weight parameters of the linear layer that is not directly followed by the batch normalization layer may be subjected to the scaling processing, that is, the quantization weight parameters of the linear layer that is directly followed by the batch normalization layer may not be subjected to the scaling processing. Of course, the quantization weight parameters of the linear layer not directly followed by the batch normalization layer and the linear layer directly followed by the batch normalization layer may be subjected to the scaling processing at the same time. Embodiments of the present disclosure are not limited in this regard.

Step S300: and training the target quantization model by adopting the same training parameter configuration as the neural network model.

For example, in step S300, the training parameter configuration of the neural network model may refer to the relevant description in step S000, and will not be repeated herein.

Fig. 8 is an exemplary flowchart corresponding to step S300 shown in fig. 4 provided in at least one embodiment of the present disclosure. For example, as shown in fig. 8, the target quantization model is trained by using the same training parameter configuration as the neural network model, that is, step S300 includes: the method comprises a forward propagation stage, a backward propagation stage and a standard quantization stage, and repeatedly executing the three stages to obtain a trained target quantization model. The forward propagation stage, the backward propagation stage and the standard quantization stage correspond to step S310, step S320 and step S330, respectively, described below.

Step S310: the training input data is processed using the current target quantization model to obtain training output data, and a loss value is calculated based on the training output data.

For example, the operation of the forward propagation phase of the training process of the target quantization model, i.e., step S310, may be referred to the operation of the forward propagation phase of the neural network model (e.g., full-precision model) accordingly, and will not be repeated herein.

Step S320: calculating a gradient based on the loss value, and correcting the parameters of the current neural network model based on the gradient to obtain an updated neural network model;

for example, the operation of the back propagation stage of the training process of the target quantization model, i.e., step S320, may be referred to the operation of the back propagation stage of the neural network model (e.g., full-precision model) accordingly, and will not be repeated herein.

For example, in some examples, in a case that the compression and acceleration method provided by the embodiment of the present disclosure further includes step S130 (i.e., performing quantization processing on the output of the activation layer), in step S320, an activation value gradient may be calculated according to the activation value gradient formula, and the current activation value parameter may be modified based on the activation value gradient to obtain an updated activation value parameter. For example, in some examples, for the foregoing PACT activation function and activation value quantization formula, the activation value gradient formula may be expressed as:

wherein,

representing the activation value gradient.

For example, the activation value gradient formula is used for calculating the activation value gradient, which is beneficial to reducing the quantization error.

Step S330: quantizing the parameters of the updated neural network model to obtain an updated quantization model, and performing scale transformation on the updated quantization model to obtain an updated target quantization model.

For example, the operation of the standard quantization stage of the training process of the target quantization model, i.e., step S330, can refer to the related expressions of step S100 and step S200, and will not be repeated herein.

For example, by training the target quantization model in the above steps S310 to S330, the accuracy of the target quantization model can be improved, and the performance of the target quantization model can be improved.

It should be noted that, in the training process of the target quantization model, the parameters of the target quantization model (including the standard quantization weight parameters of the linear layer) are not directly updated, but the parameters of the neural network model are modified and then subjected to quantization and scale transformation, so as to update the parameters of the target quantization model.

It should be noted that, compared with the calculation of the scale transformation parameters of the linear layer based on the standard deviation of the preparation weight parameters of the linear layer (i.e., the calculation of the scale transformation parameters of the linear layer by using the third scale transformation parameter calculation formula or the second scale transformation parameter calculation formula), the calculation of the scale transformation parameters of the linear layer (i.e., the calculation of the scale transformation parameters of the linear layer by using the first scale transformation parameter calculation formula or the second scale transformation parameter calculation formula) based on the number of output neurons of the linear layer is not required to calculate var (w), so that the computation amount can be reduced, which is beneficial to accelerate the training speed of the target quantization model.

It should be noted that, in some examples, the target quantization model may not store the standard quantization weight parameters of the linear layer, but store the quantization weight parameters and the scaling parameters of the linear layer, so as to reduce the size (i.e., the occupied storage space) of the target quantization model. When the target quantization model is applied to data processing, the standard quantization weight parameter of the linear layer may be obtained through calculation of the quantization weight parameter and the scale transformation parameter of the linear layer, or the input of the linear layer may be processed through the quantization weight parameter of the linear layer to obtain the output of the linear layer, and then the output of the linear layer is processed through the scale transformation parameter. For example, the target quantization model may, accordingly, store not the bias of the linear layer (e.g., fully-connected layer) in the target quantization model, but the bias of the linear layer (e.g., fully-connected layer) in the quantization model; therefore, when the target quantization model is applied to data processing, the offset of the linear layer in the quantization model may be converted into the offset of the linear layer in the target quantization model through the scale change parameter, or the input of the linear layer may be processed through the quantization weight parameter of the linear layer in the quantization model and the offset of the linear layer in the quantization model to obtain the output of the linear layer, and then the output of the linear layer is processed through the scale change parameter, which is not limited in this embodiment of the present disclosure.

It should be noted that, in practical applications, the compression and acceleration method provided by the embodiments of the present disclosure may selectively (for example, either one of them or both of them) quantize the weight parameters of the neural network model (i.e., weight quantization) and the output of the activation layer (i.e., activation value quantization) according to practical needs.

It should be noted that, in the embodiment of the present disclosure, the neural network model and the quantization model thereof may be implemented by software, hardware, firmware, or any combination thereof, so as to execute the corresponding processing procedure.

It should be noted that, in the embodiment of the present disclosure, the flow of the compression and acceleration method of the neural network model may include more or less operations, and these operations may be performed sequentially or in parallel. Although the flow of the compression and acceleration method of the neural network model described above includes a plurality of operations occurring in a specific order, it should be clearly understood that the order of the plurality of operations is not limited. The above-described neural network model compression and acceleration method may be performed once or may be performed a plurality of times according to a predetermined condition.

At least one embodiment of the present disclosure further provides a data processing method, where the data processing method includes: the target quantization model obtained by adopting the compression and acceleration method provided by any embodiment of the disclosure is used for processing the input data to obtain the output data.

For example, in some examples, the type of input data depends on the processing objectives of the target quantization model, e.g., the input data may include images, text, speech, etc., depending on the processing objectives of the target quantization model. Taking neural network models such as ResNet, Mobile Net-V1, Mobile Net-V2 and VGG-Net and their target quantization models as examples, the input data can be images.

For example, the output data may represent the results of inferential predictions made by the target quantization model over the input data. Taking neural network models such as ResNet, Mobile Net-V1, Mobile Net-V2, VGG-Net, and the like, and target quantization models thereof as examples, the output data thereof can represent the classification results of the images (i.e., the input data).

For example, in some examples, the target quantization model may be deployed in a mobile terminal and an embedded system such as a smart phone, a tablet computer, a car navigator, and the like, so that the mobile terminal and the embedded system and the like may perform the data processing method.

In the following, taking the MobileNet-V1 neural network model and the MobileNet-V2 neural network model as examples, the quantization scheme precision comparison at different bit widths is exemplarily shown by tables 1-2. Table 1 is a quantization scheme precision comparison table (quantizing weights and activation values) for MobileNet-V1 and MobileNet-V2 under different bit widths (i.e., the number of quantization bits); table 2 shows a comparison table of quantization scheme accuracies (quantization of weights, no quantization of activation values) for different bit widths of MobileNet-V1 and MobileNet-V2.

It should be noted that, in table 1-2, pact (quantized quantization activation), HAQ (hard-ware automatic quantization), Deep Compression are known quantization schemes, and SAT is a quantization scheme (i.e., Compression and acceleration method) provided by the embodiments of the present disclosure, where the scaling parameters of the linear layer are calculated based on the number of output neurons of the linear layer (using a third scaling parameter calculation formula). It should be noted that the bit width of the HAQ scheme is flexible (flexible), so the bit width of the HAQ scheme in tables 1-2 is equivalent bit width, for example, the equivalent bit width is 2, 3, 4, 5, 6, 8, etc., respectively, so that the precision comparison can be performed with other quantization schemes under the corresponding bit width. In addition, in tables 1-2, FP represents the corresponding full-precision model; acc. -1 represents the probability that one candidate class of model output is the correct class of the input image, and Acc. -5 represents the probability that the five candidate classes of model output include the correct class of the input image. For example, specific details of PACT protocols can be found in the literature, Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan PACT, Parametric Clipping Activation for quantized Neural Networks, arXiv:1805.06085,2018; specific details of HAQ protocols can be found in the literature, Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han.HAQ: Hardware-aware automated quantification with Mixed precipitation, arXiv:1811.08886,2019; details of the Deep Compression scheme can be found in the literature, Song Han, Huizi Mao, and William JDall. Deep Compression: Compressing Deep Neural Networks with sounding, TracedQuantization and Huffman coding. arXiv:1510.00149,2015. The above documents are hereby incorporated by reference in their entirety as part of the present disclosure.

TABLE 1 quantization scheme precision comparison tables for different bit widths for MobileNet-V1 and MobileNet-V2 (quantizing weights and activation values)

TABLE 2 quantization scheme precision comparison tables for different bit widths for MobileNet-V1 and MobileNet-V2 (quantize weights, not activate values)

As can be seen from tables 1-2, the accuracy of the target quantization model obtained by using the compression and acceleration method provided by the embodiment of the present disclosure is in most cases higher than that of the quantization models obtained by using other known quantization schemes, which indicates that the compression and acceleration method provided by the embodiment of the present disclosure can improve the accuracy of the target quantization model and improve the performance of the target quantization model.

For technical effects of the data processing method provided by the embodiments of the present disclosure, reference may be made to the corresponding description of the compression and acceleration method of the neural network model in the above embodiments, and details are not repeated herein.

At least one embodiment of the present disclosure further provides a data processing apparatus. Fig. 9 is a schematic block diagram of a data processing apparatus according to at least one embodiment of the present disclosure.

For example, as shown in FIG. 9, the data processing apparatus 500 includes a memory 510 and a processor 520. For example, the memory 510 is used for non-transitory storage of computer readable instructions, and the processor 520 is used for executing the computer readable instructions, and the computer readable instructions are executed by the processor 520 to perform the compression and acceleration method of the neural network model or/and the data processing method provided by any embodiment of the disclosure.

For example, the memory 510 and the processor 520 may be in direct or indirect communication with each other. For example, in some examples, as shown in FIG. 9, the data processing apparatus 500 may further include a system bus 530, and the memory 510 and the processor 520 may communicate with each other via the system bus 530, for example, the processor 520 may access the memory 510 via the system bus 1006. For example, in other examples, components such as memory 510 and processor 520 may communicate over a network connection. The network may include a wireless network, a wired network, and/or any combination of wireless and wired networks. The network may include a local area network, the Internet, a telecommunications network, an Internet of Things (Internet of Things) based on the Internet and/or a telecommunications network, and/or any combination thereof, and/or the like. The wired network may communicate by using twisted pair, coaxial cable, or optical fiber transmission, for example, and the wireless network may communicate by using 3G/4G/5G mobile communication network, bluetooth, Zigbee, or WiFi, for example. The present disclosure is not limited herein as to the type and function of the network.

For example, the processor 520 may control other components in the data processing apparatus to perform desired functions. The processor 520 may be a device having data processing capability and/or program execution capability, such as a Central Processing Unit (CPU), Tensor Processor (TPU), or Graphics Processor (GPU). The Central Processing Unit (CPU) may be an X86 or ARM architecture, etc. The GPU may be separately integrated directly onto the motherboard, or built into the north bridge chip of the motherboard. The GPU may also be built into the Central Processing Unit (CPU).

For example, memory 510 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like.

For example, one or more computer instructions may be stored on memory 510 and executed by processor 520 to implement various functions. Various applications and various data, such as preparation weight parameters of the linear layer, standard quantization weight parameters of the linear layer, scaling parameters of the linear layer, activation value parameters, and various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

For example, some of the computer instructions stored by memory 510, when executed by processor 520, may perform one or more steps according to the compression and acceleration methods described above. As another example, other computer instructions stored by memory 510 may, when executed by processor 520, perform one or more steps in accordance with the data processing methods described above.

For example, as shown in fig. 9, the data processing apparatus 500 may further include an input interface 540 that allows an external device to communicate with the data processing apparatus 500. For example, input interface 540 may be used to receive instructions from an external computer device, from a user, and the like. The data processing apparatus 500 may also include an output interface 550 to interconnect the data processing apparatus 500 and one or more external devices. For example, the data processing apparatus 500 may display an image or the like through the output interface 550. External devices that communicate with the data processing apparatus 500 through the input interface 1010 and the output interface 1012 may be included in an environment that provides any type of user interface with which a user may interact. Examples of user interface types include graphical user interfaces, natural user interfaces, and the like. For example, the graphical user interface may accept input from a user using input device(s) such as a keyboard, mouse, remote control, etc., and provide output on an output device such as a display. Furthermore, a natural user interface may enable a user to interact with the data processing apparatus 500 in a manner that does not require the constraints imposed by input devices such as a keyboard, mouse, remote control, and the like. Instead, natural user interfaces may rely on speech recognition, touch and stylus recognition, gesture recognition on and near the screen, air gestures, head and eye tracking, speech and speech, vision, touch, gestures, and machine intelligence, among others.

In addition, although illustrated as a single system in fig. 9, it is to be understood that the data processing apparatus 500 may also be a distributed system, and may also be arranged as a cloud infrastructure (including a public cloud or a private cloud). Thus, for example, several devices may communicate over a network connection and may collectively perform tasks described as being performed by the data processing apparatus 500.

For example, for the detailed description of the processing procedure of the compression and acceleration method, reference may be made to the related description in the embodiment of the compression and acceleration method, and for the detailed description of the processing procedure of the data processing method, reference may be made to the related description in the embodiment of the data processing method, and repeated parts are not repeated.

For example, in some examples, the data processing device may include, but is not limited to, a mobile terminal such as a smart phone, a tablet computer, a car navigator, and an embedded system.

It should be noted that the data processing apparatus provided in the embodiments of the present disclosure is illustrative and not restrictive, and the data processing apparatus may further include other conventional components or structures according to practical application needs, for example, in order to implement the necessary functions of the data processing apparatus, a person skilled in the art may set other conventional components or structures according to a specific application scenario, and the embodiments of the present disclosure are not limited thereto.

For technical effects of the data processing apparatus provided by the embodiments of the present disclosure, reference may be made to corresponding descriptions about the compression and acceleration method and the data processing method in the foregoing embodiments, and details are not repeated herein.

At least one embodiment of the present disclosure also provides a storage medium. Fig. 10 is a schematic diagram of a storage medium according to an embodiment of the disclosure. For example, as shown in fig. 10, the storage medium 600 non-transitory stores computer readable instructions 601, and when the non-transitory computer readable instructions 601 are executed by a computer (including a processor), the instructions of the compression and acceleration method provided by any embodiment of the disclosure may be executed or the instructions of the data processing method provided by any embodiment of the disclosure may be executed.

For example, one or more computer instructions may be stored on the storage medium 600. Some of the computer instructions stored on the storage medium 600 may be, for example, instructions for implementing one or more steps of the compression and acceleration methods described above. Further computer instructions stored on the storage medium may be, for example, instructions for carrying out one or more steps of the above-described data processing method.

For example, the storage medium may include a storage component of a tablet computer, a hard disk of a personal computer, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a compact disc read only memory (CD-ROM), a flash memory, or any combination of the above storage media, as well as other suitable storage media.

For technical effects of the storage medium provided by the embodiments of the present disclosure, reference may be made to corresponding descriptions about a compression and acceleration method and a data processing method in the foregoing embodiments, and details are not repeated herein.

For the present disclosure, there are the following points to be explained:

(1) in the drawings of the embodiments of the present disclosure, only the structures related to the embodiments of the present disclosure are referred to, and other structures may refer to general designs.

(2) Features of the disclosure in the same embodiment and in different embodiments may be combined with each other without conflict.

The above is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and shall be covered by the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A compression and acceleration method of a neural network model, the neural network model including a linear layer, parameters of the neural network model including preparatory weight parameters, the compression and acceleration method comprising:

quantizing the parameters of the neural network model to obtain a quantization model, wherein the parameters of the quantization model comprise quantization weight parameters of the linear layer; and

carrying out scale transformation processing on the quantization model to obtain a target quantization model;

wherein performing the scaling process on the quantization model comprises:

calculating a scale transformation parameter of the linear layer based on the number of output neurons of the linear layer or a standard deviation of preparation weight parameters of the linear layer; and

and based on the scale transformation parameters of the linear layer, carrying out the scale transformation processing on the quantization weight parameters of the linear layer to obtain standard quantization weight parameters of the linear layer.

2. The compression and acceleration method of claim 1, wherein the linear layer comprises at least one selected from the group consisting of a convolutional layer, a recursive layer, and a fully-connected layer.

3. A compression and acceleration method according to claim 1 or 2, wherein the linear layer is not directly followed by a batch normalization layer.

4. A compression and acceleration method according to any of the claims 1-3, wherein quantizing parameters of the neural network model to obtain the quantized model comprises:

clamping the preparation weight parameter of the linear layer to obtain a clamping weight parameter of the linear layer; and

and carrying out quantization processing on the clamping weight parameters of the linear layer to obtain the quantization weight parameters of the linear layer.

5. The compression and acceleration method of claim 4, wherein calculating the scaling parameters of the linear layer based on the number of output neurons of the linear layer comprises:

calculating the scale transformation parameters of the linear layer according to a first scale transformation parameter calculation formula, wherein the first scale transformation parameter calculation formula is expressed as:

wherein RSF represents a scaling parameter of the linear layer,

6. The compression and acceleration method of claim 5, wherein the number of bits of the quantization weight parameter of the linear layer is 1-8.

7. The compression and acceleration method of claim 6, wherein the number of bits of the quantization weight parameter of the linear layer is 1-2.

8. The compression and acceleration method of claim 4, wherein calculating the scaling parameters of the linear layer based on the number of output neurons of the linear layer comprises:

calculating the scale transformation parameters of the linear layer according to a second scale transformation parameter calculation formula, wherein the second scale transformation parameter calculation formula is expressed as:

wherein RSF represents a scaling parameter of the linear layer,

representing the number of output neurons of the linear layer,

an auxiliary weight matrix representing the linear layer,representing a variance of an element of an auxiliary weight matrix of the linear layer;

the auxiliary weight matrix of the linear layer

Expressed as:

wherein,a clamp weight matrix representing the linear layer.

9. The compression and acceleration method of claim 4, wherein calculating the scaling parameters of the linear layer based on the standard deviation of the preparation weight parameters of the linear layer comprises:

calculating the scale transformation parameters of the linear layer according to a third scale transformation parameter calculation formula, wherein the third scale transformation parameter calculation formula is expressed as:

the auxiliary weight matrix of the linear layer

Expressed as:

wherein,

a clamp weight matrix representing the linear layer.

10. The compression and acceleration method according to claim 8 or 9, wherein the number of bits of the quantization weight parameter of the linear layer is 1-8.

11. The compression and acceleration method of claim 10, wherein the number of bits of the quantization weight parameter of the linear layer is 3-8.

12. The compression and acceleration method according to any one of claims 5-11, wherein the scaling the quantization weight parameters of the linear layer based on the scaling parameters of the linear layer to obtain the standard quantization weight parameters of the linear layer comprises:

and carrying out the scale transformation processing on the quantization weight parameters of the linear layer according to a scale transformation formula, wherein the scale transformation formula is expressed as follows:

13. The compression and acceleration method according to any of the claims 4-12, wherein the clipping the preparation weight parameters of the linear layer to obtain the clipping weight parameters of the linear layer comprises:

performing the clamping processing on the preparation weight parameter of the linear layer according to a clamping formula, wherein the clamping formula is expressed as:

wherein,

a clamping weight matrix representing the linear layer,

represents the parameters of the ith row and the jth column of the clamped weight matrix, W represents the preparation weight matrix of the linear layer, W_ijA parameter, W, representing the ith row and the jth column of the preparation weight matrix for the linear layer_mnThe parameter of the nth column of the mth row of the preparation weight matrix of the linear layer is represented, tanh (·) represents a hyperbolic tangent function, and max (·) represents a max-valued function.

14. The compression and acceleration method of claim 13, wherein the performing the quantization process on the clamped weight parameters of the linear layer to obtain the quantized weight parameters of the linear layer comprises:

and carrying out the quantization processing on the clamp weight parameter of the linear layer according to a weight quantization formula, wherein the weight quantization formula is expressed as:

wherein Q represents a quantization weight matrix of the linear layer, Q_ijB represents the number of bits of the quantization weight parameter of the linear layer, roundd (-) represents a rounding function.

15. The compression and acceleration method of any of claims 4-14, further comprising:

and training the target quantization model by adopting the same training parameter configuration as the neural network model.

16. The compression and acceleration method of claim 15, wherein the training process of the target quantization model comprises: a forward propagation stage, a backward propagation stage and a standard quantization stage;

the forward propagation phase comprises: processing training input data by using a current target quantization model to obtain training output data, and calculating a loss value based on the training output data;

the back propagation phase comprises: calculating a gradient based on the loss value, and correcting parameters of the current neural network model based on the gradient to obtain an updated neural network model;

the standard quantization stage comprises: quantizing parameters of the updated neural network model to obtain an updated quantization model, and performing scale transformation processing on the updated quantization model to obtain an updated target quantization model.

17. The compression and acceleration method of claim 16 wherein the neural network model includes an activation layer that includes a PACT activation function represented as:

wherein,

18. The compression and acceleration method of claim 17, wherein the back propagation phase further comprises:

calculating an activation value gradient according to an activation value gradient formula, correcting a current activation value parameter based on the activation value gradient to obtain an updated activation value parameter,

the activation value gradient formula is expressed as:

wherein,

representing the activation value gradient.

19. The compression and acceleration method according to any of the claims 15-18, wherein the training parameter configuration comprises: initial learning rate, learning rate adjustment scheme, weight attenuation, iteration times of a training set, optimizer and batch size.

20. The compression and acceleration method according to any of the claims 1-19, wherein prior to quantizing the parameters of the neural network model, the compression and acceleration method further comprises:

and pre-training the neural network model to obtain a preparation weight parameter of the neural network model.

21. The compression and acceleration method of claim 20, wherein the pre-training of the neural network model comprises:

parameters of the neural network model are initialized using an happy-inch initialization scheme.

22. The compression and acceleration method of any one of claims 1-21, wherein the neural network model includes one of ResNet, MobileNet-V1, MobileNet-V2, and VGG-Net.

23. A method of data processing, comprising:

processing input data using the target quantization model obtained by the compression and acceleration method of any one of claims 1 to 22.

24. A data processing apparatus comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing computer readable instructions;

wherein the computer readable instructions, when executed by the processor, perform the compression and acceleration method of any one of claims 1-22 or perform the data processing method of claim 23.

25. A storage medium storing non-transitory computer readable instructions, wherein the non-transitory computer readable instructions, when executed by a computer, may perform instructions of the compression and acceleration method according to any one of claims 1-22 or may perform instructions of the data processing method according to claim 23.