CN113887706A

CN113887706A - Method and device for low bit quantization aiming at one-stage target detection network

Info

Publication number: CN113887706A
Application number: CN202111163481.4A
Authority: CN
Inventors: 王曦辉
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2022-01-04
Anticipated expiration: 2041-09-30
Also published as: CN113887706B

Abstract

The present invention provides a method, system, device and storage medium for low bit quantization for a stage one target detection network, the method comprising: inserting pseudo-quantization nodes in all convolution layers or all-connection layers of a main network of a target detection network, adding a first weight pseudo-quantization layer in the weight part of the convolution layers or all-connection layers, and adding a first input pseudo-quantization layer in the input part of the convolution layers or all-connection layers; replacing a relu layer in the backbone network with a satRelu layer; adding a second weight pseudo-quantization layer to a weight portion of a convolutional layer of a detection portion of the target detection network, and adding a second input pseudo-quantization layer to an input portion of the convolutional layer of the detection portion; and adding a satFun layer after the convolution layer of the detection part. The invention can ensure that the precision loss of the model is less than 1 percent, greatly improve the reasoning speed of the model and reduce the space occupied by the storage of the model.

Description

Method and device for low bit quantization aiming at one-stage target detection network

Technical Field

The present invention relates to the field of object detection, and more particularly, to a method, system, apparatus, and storage medium for low bit quantization for a one-stage object detection network.

Background

Neural Network models generally occupy a large disk space, for example, the model file of AlexNet exceeds 200 MB. The models contain millions of parameters and most of the space is used to store the parameters of the models. These parameters are of the floating-point type and it is difficult for common compression algorithms to compress their space.

The internal calculation of a general model adopts floating point number calculation, the calculation of the floating point number consumes relatively large calculation resources (space and cpu/gpu time), if the internal calculation of the model can adopt other simple numerical value types to carry out calculation under the condition of not influencing the accuracy rate of the model, the calculation speed is greatly improved, the consumed calculation resources are greatly reduced, and particularly for mobile equipment, the point is more important, and therefore a quantification technology is introduced.

Quantization is the compression of the original network by reducing the number of bits required to represent each weight. For 8-bit quantization models that can be compressed to 1/4, the running speed of the network can be greatly increased. Compared with 8-bit quantization, 4-bit quantization can continuously compress the volume of the model by 1 time on the basis of 8 bits, and the running speed is improved by 50%. However, since 4 bits can only represent 16 numbers at most, the classification accuracy of the model is reduced. In addition, the absence of int4 in the cpu, which is a data type, also causes practical operational difficulties.

Unlike offline quantization, training quantization requires simulating the effects of quantization operations in training, and by training, the model learns and adapts to errors caused by quantization operations, thereby improving quantization accuracy. Training Quantization is therefore also referred to as Quantization-aware Training (QAT), meaning that Training has realized that this model will be converted into a Quantization model.

The model quantization can solve the problems of large parameter quantity, large calculation quantity, large memory occupation and the like of the conventional convolutional neural network, and has the potential advantages of compressing parameters, improving the speed, reducing the memory occupation and the like for the neural network.

At present, most training and reasoning frames support int8 quantization, but because int4 causes serious model accuracy reduction and the like, int4 quantization is not supported. Many low-bit quantization algorithms adopt a nonlinear quantization mode, and nonlinear quantization can improve model accuracy, but introduces additional operation and reduces operation speed.

At present, for a quantization method for a one-stage target detection network, in order to reduce precision loss, only a backbone network part of a model is generally quantized, and quantization of a detection network is not performed. Even though existing quantization methods may achieve less degradation of accuracy and good performance acceleration in image classification tasks, using low bit quantization in object detection remains a challenge while maintaining accuracy.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, a system, a computer device, and a computer readable storage medium for low bit quantization of a one-stage target detection network, in which a pseudo quantization layer and a satRelu layer are added to a main network and a detection portion of a target detection network, respectively, so that a compression rate of a model is improved by low bit quantization under a condition of a small loss of model accuracy, an inference speed of the model is greatly improved, and a space occupied by model storage is reduced.

In view of the above, an aspect of the embodiments of the present invention provides a method for low bit quantization for a one-stage target detection network, including the following steps: inserting pseudo-quantization nodes in all convolution layers or all-connection layers of a main network of a target detection network, adding a first weight pseudo-quantization layer in the weight part of the convolution layers or all-connection layers, and adding a first input pseudo-quantization layer in the input part of the convolution layers or all-connection layers; replacing a relu layer in the backbone network with a satRelu layer; adding a second weight pseudo-quantization layer to a weight portion of a convolutional layer of a detection portion of the target detection network, and adding a second input pseudo-quantization layer to an input portion of the convolutional layer of the detection portion; and adding a satFun layer after the convolution layer of the detection part.

In some embodiments, said adding a first weight pseudo-quantization layer at a weight portion of said convolutional layer or fully-connected layer and a first input pseudo-quantization layer at an input portion of said convolutional layer or fully-connected layer comprises: the input of the first weight pseudo-quantization layer is limited to between-8 and 7 and the input of the first input pseudo-quantization layer is limited to between 0 and 15.

In some embodiments, said adding a second weight pseudo-quantization layer in a weight portion of a convolutional layer of a detection portion of said target detection network and a second input pseudo-quantization layer in an input portion of a convolutional layer of said detection portion comprises: the input of the second weight pseudo-quantization layer is limited to between-128 and 127 and the input of the second input pseudo-quantization layer is limited to between-128 and 127.

In some embodiments, the method further comprises: and training the target detection network by adopting a mode of combining Mosaic and MixuUp data enhancement algorithms.

In another aspect of the embodiments of the present invention, a system for low bit quantization for a one-stage target detection network is provided, including: the first pseudo-quantization module is configured to insert pseudo-quantization nodes into all convolution layers or all-connection layers of a backbone network of a target detection network, add a first weight pseudo-quantization layer to a weight part of the convolution layers or all-connection layers, and add a first input pseudo-quantization layer to an input part of the convolution layers or all-connection layers; a first precision module configured to replace a relu layer in the backbone network with a satRelu layer; a second pseudo-quantization module configured to add a second weight pseudo-quantization layer at a weight portion of a convolution layer of a detection portion of the target detection network and to add a second input pseudo-quantization layer at an input portion of the convolution layer of the detection portion; and a second precision module configured to add a satFun layer after the convolution layer of the detection portion.

In some embodiments, the first pseudo-quantization module is configured to: the input of the first weight pseudo-quantization layer is limited to between-8 and 7 and the input of the first input pseudo-quantization layer is limited to between 0 and 15.

In some embodiments, the second pseudo quantization module is configured to: the input of the second weight pseudo-quantization layer is limited to between-128 and 127 and the input of the second input pseudo-quantization layer is limited to between-128 and 127.

In some embodiments, the system further comprises a training module configured to: and training the target detection network by adopting a mode of combining Mosaic and MixuUp data enhancement algorithms.

In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method as above.

In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.

The invention has the following beneficial technical effects: by respectively adding a pseudo-quantization layer and introducing a satralu layer to a main network and a detection part of a target detection network, the compression rate of the model is improved through low bit quantization under the condition of small model precision loss, the inference speed of the model is greatly improved, and the space occupied by model storage is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

Fig. 1 is a schematic diagram of an embodiment of a method for low bit quantization for a one-stage target detection network according to the present invention;

FIG. 2 is a schematic diagram of an embodiment of a system for low bit quantization for a one-stage target detection network provided by the present invention;

FIG. 3 is a schematic hardware diagram of an embodiment of a low bit quantization computer apparatus for a one-stage target detection network according to the present invention;

FIG. 4 is a schematic diagram of an embodiment of a computer storage medium for low bit quantization for a one-stage target detection network provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In a first aspect of embodiments of the present invention, embodiments of a method for low bit quantization for a one-stage target detection network are presented. Fig. 1 is a schematic diagram illustrating an embodiment of a method for low bit quantization for a one-stage target detection network according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:

s1, inserting pseudo-quantization nodes in all convolution layers or all-connection layers of a main network of a target detection network, adding a first weight pseudo-quantization layer in the weight part of the convolution layers or all-connection layers, and adding a first input pseudo-quantization layer in the input part of the convolution layers or all-connection layers;

s2, replacing a relu layer in the backbone network with a satRelu layer;

s3, adding a second weight pseudo-quantization layer to the weight part of the convolution layer of the detection part of the target detection network, and adding a second input pseudo-quantization layer to the input part of the convolution layer of the detection part; and

and S4, adding a satFun layer after the convolution layer of the detection part.

The one-stage object detection network (ssd, yolo, retinet) mainly comprises two parts, namely a backbone network and a detection network. The backbone network is formed by some common classification networks, such as renet and mobilenet, and the detection part is mainly formed by a convolution network and used for outputting classification and position information. A quantization model is trained by adopting a quantization perception training method, wherein low bits are used for forward propagation, and fp32 return gradient is adopted in the backward propagation process. On the basis of an original one-stage detection model, a pseudo quantization node is required to be added, and a relu node is replaced by satRelu. The invention adopts a linear quantization mode and can also obtain the precision of nonlinear quantization. The low bits in the embodiment of the present invention include 4 bits and 8 bits.

Inserting pseudo-quantization nodes in all convolution layers or all-connection layers of a main network of a target detection network, adding a first weight pseudo-quantization layer in the weight part of the convolution layers or all-connection layers, and adding a first input pseudo-quantization layer in the input part of the convolution layers or all-connection layers.

For all convolution layers or fully-connected layers in the backbone network, inserting 2 pseudo-quantization nodes for auxiliary training, wherein the pseudo-quantization functions to reduce the influence of network quantization on precision, and for the convolution/fully-connected weight increase weight pseudo-quantization layer, the specific operation of the weight pseudo-quantization layer is as follows:

w is a weighted value, a clip function represents amplitude limiting operation, for 4-bit quantization, the clip limits all inputs to-8-7, a round function is to round and round the inputs, and a quantization coefficient scale is defined as:

the value of scale is a mode of sharing one scale according to each output channel to weight w, wherein w is_iThe single-channel weight is divided according to the number of output channels, wherein i is 0, 1 … n and n is the number of output channels of w.

Adding a pseudo quantization layer to the convolved/fully connected input, the specific operation of the pseudo quantization layer is:

wherein in is the input of the convolution/full-connected layer, and because the relu layer exists after the convolution/full-connected, the input value of the next layer is positive. In order to fully utilize the bit space, a clip function is required to limit the input to 0-15. sat is the value generated by the subsequent satRelu.

Replacing a relu layer in the backbone network with a satRelu layer.

In the network training, in order to reduce the precision loss caused by low bit quantization, a satralu layer needs to be introduced to replace an original relu layer. The definition of the satRelu layer is:

wherein the sat value is a variable and can be automatically adjusted in the training process to find the optimal value. And the direction propagation process of sat needs to be manually realized in training, and the gradient for sat in the back propagation process is as follows:

since the detection network part is responsible for directly outputting the detection result, the quantization of the detection result has a great influence on the precision. And no relu layer is present after the convolution layer of the detection portion, so the same method as described above cannot be directly employed. The quantization mode of the part is as follows:

adding a second weight pseudo-quantization layer at a weight portion of the convolutional layer of the detection portion of the target detection network, and adding a second input pseudo-quantization layer at an input portion of the convolutional layer of the detection portion.

The pseudo-quantization layer needs to be added to the convolution layer of the detection section as well. But to reduce the loss of precision, 8-bit quantization is used for this partial convolution. For the weight of the convolutional layer, a weight pseudo-quantization layer is added:

because 8-bit quantization is adopted, the clip amplitude limiting range is widened to be between-128 and 127, and a quantization coefficient scale is defined as:

for the input of the convolutional layer, an input pseudo-quantization layer is added:

where sat is the value generated by satFun in the next step. Because the detection network does not have the activation functions such as relu and the like, the input can be positive or negative, and the clipping range of the clip function is-128-127.

A satFun layer is added after the convolution layer of the detection part.

In the network training, in order to reduce the precision loss caused by low bit quantization, a satFun layer needs to be added after a convolutional layer. The definition of the satFun layer is:

In order to further reduce the precision loss in the network training process, the following skills are adopted in the network quantitative training:

1. gradually reducing the number of bits

In the training process, the problem of gradient mismatch is reduced by reducing the number of bits once every several cycles. For the main network part, the bit number is gradually reduced to 4 bits by adopting a mode of 16 bits- >8 bits- >6 bits- >4 bits. For the network detection part, the bit number is gradually reduced to 8 bits in a float- > int32- > int8 mode.

2. Using focalloss loss function

Focalloss is used to reduce the problem of positive and negative sample imbalance in target detection.

3. Usage data enhancement

After the model is quantized, the detection effect of the small target is obviously reduced, so that the detection effect of the small target is improved by adopting a mode of combining Mosaic and MixUp data enhancement algorithms in the training process.

According to the embodiment of the invention, the pseudo-quantization layer and the satRelu layer are respectively added to the main network and the detection part of the target detection network, so that the compression rate of the model is improved through low bit quantization under the condition of small model precision loss, the inference speed of the model is greatly improved, and the space occupied by model storage is reduced.

It should be particularly noted that, the steps in the embodiments of the method for low bit quantization for a one-stage target detection network described above may be mutually intersected, replaced, added, and deleted, and therefore, these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiments.

In view of the above objects, a second aspect of the embodiments of the present invention provides a system for low bit quantization for a one-stage target detection network. As shown in fig. 2, the system 200 includes the following modules: the first pseudo-quantization module is configured to insert pseudo-quantization nodes into all convolution layers or all-connection layers of a backbone network of a target detection network, add a first weight pseudo-quantization layer to a weight part of the convolution layers or all-connection layers, and add a first input pseudo-quantization layer to an input part of the convolution layers or all-connection layers; a first precision module configured to replace a relu layer in the backbone network with a satRelu layer; a second pseudo-quantization module configured to add a second weight pseudo-quantization layer at a weight portion of a convolution layer of a detection portion of the target detection network and to add a second input pseudo-quantization layer at an input portion of the convolution layer of the detection portion; and a second precision module configured to add a satFun layer after the convolution layer of the detection portion.

In view of the above object, a third aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, inserting pseudo-quantization nodes in all convolution layers or all-connection layers of a main network of a target detection network, adding a first weight pseudo-quantization layer in the weight part of the convolution layers or all-connection layers, and adding a first input pseudo-quantization layer in the input part of the convolution layers or all-connection layers; s2, replacing a relu layer in the backbone network with a satRelu layer; s3, adding a second weight pseudo-quantization layer to the weight part of the convolution layer of the detection part of the target detection network, and adding a second input pseudo-quantization layer to the input part of the convolution layer of the detection part; and S4, adding a satFun layer after the convolution layer of the detection part.

In some embodiments, the steps further comprise: and training the target detection network by adopting a mode of combining Mosaic and MixuUp data enhancement algorithms.

Fig. 3 is a schematic hardware structure diagram of an embodiment of the computer apparatus for low bit quantization of the one-stage target detection network according to the present invention.

Taking the device shown in fig. 3 as an example, the device includes a processor 301 and a memory 302.

The processor 301 and the memory 302 may be connected by a bus or other means, such as the bus connection in fig. 3.

The memory 302, which is a non-volatile computer-readable storage medium, may be used for storing non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for low bit quantization of a one-phase target detection network in the embodiments of the present application. The processor 301 executes various functional applications of the server and data processing, i.e., implements a method of low bit quantization for a one-stage target detection network, by running non-volatile software programs, instructions, and modules stored in the memory 302.

The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of a method of low bit quantization for a one-stage target detection network, or the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 302 optionally includes memory located remotely from processor 301, which may be connected to a local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more corresponding computer instructions 303 for a method of low bit quantization for a one-phase target detection network are stored in the memory 302 and, when executed by the processor 301, perform the method of low bit quantization for a one-phase target detection network in any of the method embodiments described above.

Any one of the embodiments of a computer apparatus for performing the method for low bit quantization for a one-stage object detection network described above may achieve the same or similar effects as any one of the preceding method embodiments corresponding thereto.

The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, performs a method of low bit quantization for a phase target detection network.

Fig. 4 is a schematic diagram of an embodiment of a computer storage medium for low bit quantization for a one-stage target detection network according to the present invention. Taking the computer storage medium as shown in fig. 4 as an example, the computer readable storage medium 401 stores a computer program 402 which, when executed by a processor, performs the method as described above.

Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and a program of the method for low bit quantization for a one-stage target detection network can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method of low bit quantization for a one-stage target detection network, comprising the steps of:

inserting pseudo-quantization nodes in all convolution layers or all-connection layers of a main network of a target detection network, adding a first weight pseudo-quantization layer in the weight part of the convolution layers or all-connection layers, and adding a first input pseudo-quantization layer in the input part of the convolution layers or all-connection layers;

replacing a relu layer in the backbone network with a satRelu layer;

adding a second weight pseudo-quantization layer to a weight portion of a convolutional layer of a detection portion of the target detection network, and adding a second input pseudo-quantization layer to an input portion of the convolutional layer of the detection portion; and

a satFun layer is added after the convolution layer of the detection part.

2. The method of claim 1, wherein adding a first weight pseudo-quantization layer at a weight portion of the convolutional layer or fully-connected layer and adding a first input pseudo-quantization layer at an input portion of the convolutional layer or fully-connected layer comprises:

the input of the first weight pseudo-quantization layer is limited to between-8 and 7 and the input of the first input pseudo-quantization layer is limited to between 0 and 15.

3. The method of claim 2, wherein adding a second weight pseudo-quantization layer at a weight portion of a convolutional layer of a detection portion of the target detection network and adding a second input pseudo-quantization layer at an input portion of the convolutional layer of the detection portion comprises:

the input of the second weight pseudo-quantization layer is limited to between-128 and 127 and the input of the second input pseudo-quantization layer is limited to between-128 and 127.

4. The method of claim 1, further comprising:

and training the target detection network by adopting a mode of combining Mosaic and MixuUp data enhancement algorithms.

5. A system for low bit quantization for a one-stage target detection network, comprising:

the first pseudo-quantization module is configured to insert pseudo-quantization nodes into all convolution layers or all-connection layers of a backbone network of a target detection network, add a first weight pseudo-quantization layer to a weight part of the convolution layers or all-connection layers, and add a first input pseudo-quantization layer to an input part of the convolution layers or all-connection layers;

a first precision module configured to replace a relu layer in the backbone network with a satRelu layer;

a second pseudo-quantization module configured to add a second weight pseudo-quantization layer at a weight portion of a convolution layer of a detection portion of the target detection network and to add a second input pseudo-quantization layer at an input portion of the convolution layer of the detection portion; and

a second precision module configured to add a satFun layer after the convolution layer of the detection portion.

6. The system of claim 5, wherein the first pseudo-quantization module is configured to:

7. The system of claim 6, wherein the second pseudo quantization module is configured to:

8. The system of claim 5, further comprising a training module configured to:

9. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.