CN113887706B

CN113887706B - Method and device for low-bit quantization of one-stage target detection network

Info

Publication number: CN113887706B
Application number: CN202111163481.4A
Authority: CN
Inventors: 王曦辉
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2024-02-06
Anticipated expiration: 2041-09-30
Also published as: CN113887706A

Abstract

The invention provides a method, a system, a device and a storage medium for low-bit quantization of a one-stage target detection network, wherein the method comprises the following steps: inserting pseudo quantization nodes into all convolution layers or all connection layers of a backbone network of a target detection network, adding a first weight pseudo quantization layer into the weight part of the convolution layers or all connection layers, and adding a first input pseudo quantization layer into the input part of the convolution layers or all connection layers; replacing a relu layer in the backbone network with a satRelu layer; adding a second weight pseudo-quantization layer to a weight part of a convolution layer of a detection part of the target detection network, and adding a second input pseudo-quantization layer to an input part of the convolution layer of the detection part; and adding a satFun layer after the convolved layer of the detection portion. The invention can ensure that the model precision loss is less than 1 percent, greatly improve the reasoning speed of the model and reduce the space occupied by model storage.

Description

Method and device for low-bit quantization of one-stage target detection network

Technical Field

The present invention relates to the field of object detection, and more particularly, to a method, system, device, and storage medium for low bit quantization for a one-phase object detection network.

Background

The Neural Network model generally occupies a large disk space, for example, the model file of AlexNet exceeds 200MB. Models contain millions of parameters, and a vast majority of space is used to store the parameters of these models. These parameters are of the floating point type and it is difficult for a common compression algorithm to compress their space.

The calculation of the inside of the general model adopts floating point number calculation, the calculation of the floating point number consumes relatively large calculation resources (space and cpu/gpu time), if the calculation can be carried out by adopting other simple numerical types under the condition that the accuracy of the model is not affected, the calculation speed can be greatly improved, the consumed calculation resources can be greatly reduced, and particularly for mobile equipment, the calculation is more important, so that quantization technology is introduced.

Quantization compresses the original network by reducing the number of bits needed to represent each weight. For an 8-bit quantization model, the compression can be realized to be 1/4, and the running speed of the network can be greatly improved. Compared with 8-bit quantization, 4-bit quantization can compress the volume of the model by 1 time on an 8-bit basis, and the running speed is improved by 50%. However, 4 bits can only represent 16 numbers at maximum, so that the classification accuracy of the model is reduced. In addition, the absence of the data type int4 in cpu also leads to practical operational difficulties.

Unlike offline quantization, training quantization requires simulating the effect of quantization operations in training, and training allows models to learn and adapt to errors caused by quantization operations, thereby improving the accuracy of quantization. Training Quantization is therefore also known as Quantization-aware Training (QAT), meaning that it is recognized in Training that this model will be converted into a Quantization model.

The model quantization can solve the problems of large parameter quantity, large calculation amount, large memory occupation and the like of the conventional convolutional neural network, and has the potential advantages of compressing parameters, improving the speed, reducing the memory occupation and the like of the neural network.

At present, most training and reasoning frameworks support the quantization of int8, but the quantization of int4 is not supported due to the fact that the accuracy of a model is seriously reduced by int4 and the like. Many low-bit quantization algorithms adopt a nonlinear quantization mode, and the nonlinear quantization can improve the model precision, but can introduce additional operation and reduce the operation speed.

In order to reduce precision loss, the current quantization method for a one-stage target detection network generally only quantizes a backbone network part of a model, and does not quantize the detection network. Even though existing quantization methods can achieve less degradation of accuracy and better performance acceleration in image classification tasks, the use of low-bit quantization in object detection remains a challenge with maintained accuracy.

Disclosure of Invention

In view of the above, an object of the embodiments of the present invention is to provide a method, a system, a computer device and a computer readable storage medium for low-bit quantization of a one-stage target detection network, wherein the method, the system, the computer device and the computer readable storage medium are used for increasing a pseudo quantization layer and introducing a satreu layer to a backbone network and a detection portion of the target detection network respectively, so that the compression rate of a model is improved through low-bit quantization under the condition of small model precision loss, the reasoning speed of the model is greatly improved, and the space occupied by model storage is reduced.

Based on the above objects, an aspect of the embodiments of the present invention provides a method for low-bit quantization of a one-stage object detection network, including the steps of: inserting pseudo quantization nodes into all convolution layers or all connection layers of a backbone network of a target detection network, adding a first weight pseudo quantization layer into the weight part of the convolution layers or all connection layers, and adding a first input pseudo quantization layer into the input part of the convolution layers or all connection layers; replacing a relu layer in the backbone network with a satRelu layer; adding a second weight pseudo-quantization layer to a weight part of a convolution layer of a detection part of the target detection network, and adding a second input pseudo-quantization layer to an input part of the convolution layer of the detection part; and adding a satFun layer after the convolved layer of the detection portion.

In some embodiments, adding a first weight pseudo quantization layer to the weight portion of the convolutional layer or the fully-connected layer, and adding a first input pseudo quantization layer to the input portion of the convolutional layer or the fully-connected layer comprises: the input of the first weighted pseudo quantization layer is limited to between-8 and 7 and the input of the first input pseudo quantization layer is limited to between 0 and 15.

In some embodiments, adding a second weight pseudo-quantization layer to the weight portion of the convolutional layer of the detection portion of the target detection network and adding a second input pseudo-quantization layer to the input portion of the convolutional layer of the detection portion comprises: the input of the second weight pseudo quantization layer is limited between-128 and 127, and the input of the second input pseudo quantization layer is limited between-128 and 127.

In some embodiments, the method further comprises: and training the target detection network by adopting a mode of combining two data enhancement algorithms, namely Mosaic and Mixuup.

In another aspect of an embodiment of the present invention, there is provided a system for low-bit quantization for a one-stage object detection network, including: the first pseudo-quantization module is configured to insert pseudo-quantization nodes into all convolution layers or all connection layers of a main network of the target detection network, add a first weight pseudo-quantization layer to the weight part of the convolution layers or all connection layers, and add a first input pseudo-quantization layer to the input part of the convolution layers or all connection layers; a first precision module configured to replace a relu layer in the backbone network with a satRelu layer; a second pseudo quantization module configured to add a second weight pseudo quantization layer to a weight portion of a convolution layer of a detection portion of the target detection network, and add a second input pseudo quantization layer to an input portion of the convolution layer of the detection portion; and a second precision module configured to add a satFun layer after the convolutional layer of the detection portion.

In some embodiments, the first pseudo-quantization module is configured to: the input of the first weighted pseudo quantization layer is limited to between-8 and 7 and the input of the first input pseudo quantization layer is limited to between 0 and 15.

In some embodiments, the second pseudo-quantization module is configured to: the input of the second weight pseudo quantization layer is limited between-128 and 127, and the input of the second input pseudo quantization layer is limited between-128 and 127.

In some embodiments, the system further comprises a training module configured to: and training the target detection network by adopting a mode of combining two data enhancement algorithms, namely Mosaic and Mixuup.

In yet another aspect of the embodiment of the present invention, there is also provided a computer apparatus, including: at least one processor; and a memory storing computer instructions executable on the processor, which when executed by the processor, perform the steps of the method as above.

In yet another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method steps as described above.

The invention has the following beneficial technical effects: by adding a pseudo quantization layer and introducing a satRelu layer to a main network and a detection part of a target detection network respectively, the compression rate of a model is improved through low-bit quantization under the condition of small model precision loss, the reasoning speed of the model is greatly improved, and the space occupied by model storage is reduced.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an embodiment of a method for low bit quantization for a one-phase target detection network according to the present invention;

FIG. 2 is a schematic diagram of an embodiment of a system for low bit quantization for a one-phase target detection network according to the present invention;

FIG. 3 is a schematic diagram of a hardware architecture of an embodiment of a computer device for low bit quantization of a one-phase target detection network according to the present invention;

fig. 4 is a schematic diagram of an embodiment of a computer storage medium for low-bit quantization for a one-phase object detection network according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two entities with the same name but different entities or different parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention, and the following embodiments are not described one by one.

In a first aspect of the embodiments of the present invention, an embodiment of a method for low bit quantization of a one-phase target detection network is presented. Fig. 1 is a schematic diagram of an embodiment of a method for low bit quantization for a one-stage target detection network according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:

s1, inserting pseudo quantization nodes into all convolution layers or all connection layers of a main network of a target detection network, adding a first weight pseudo quantization layer into a weight part of the convolution layers or all connection layers, and adding a first input pseudo quantization layer into an input part of the convolution layers or all connection layers;

s2, replacing a relu layer in the backbone network by using the satRelu layer;

s3, adding a second weight pseudo-quantization layer to the weight part of the convolution layer of the detection part of the target detection network, and adding a second input pseudo-quantization layer to the input part of the convolution layer of the detection part; and

and S4, adding a satFun layer after the convolution layer of the detection part.

A one-stage target detection network (ssd, yolo, retinanet) mainly comprises a backbone network and a detection network. The main network is formed by some common classification networks, such as a resnet, a mobilent, and the detection part is mainly formed by a convolution network and is used for outputting classification and position information. A quantization model is trained by using a method of quantization perception training, wherein low-bit forward propagation is used, and fp32 back propagation gradient is used in the back propagation process. Based on the original one-stage detection model, a pseudo quantization node needs to be added, and the alternative relu node is satRelu. The invention adopts a linear quantization mode, and can also obtain the precision of nonlinear quantization. The low bits in the embodiment of the present invention include 4 bits and 8 bits.

Inserting pseudo quantization nodes into all convolution layers or all connection layers of a backbone network of a target detection network, adding a first weight pseudo quantization layer into the weight part of the convolution layers or all connection layers, and adding a first input pseudo quantization layer into the input part of the convolution layers or all connection layers.

2 pseudo quantization nodes are inserted into all convolution layers or full connection layers in a backbone network for auxiliary training, the effect of pseudo quantization is to reduce the influence of network quantization on precision, wherein the weight pseudo quantization layer is added for the convolution/full connection weight, and the specific operation of the weight pseudo quantization layer is as follows:

where w is the weight value, the clip function represents the clipping operation, for 4-bit quantization, the clip will limit all inputs to between-8 and 7, the round function rounds the inputs, and the quantization factor scale is defined as:

wherein the scale is valued by sharing one scale for each output channel with the weight w, w _i For a single channel weight divided by the number of output channels, where i=0, 1 … n, n is the number of output channels of w.

The pseudo-quantization layer is added to the convolved/fully connected input, and the specific operation of the pseudo-quantization layer is as follows:

where in is the input of the convolution/full connection layer, and since there is a relu layer after convolution/full connection, the input values of the next layer are all positive numbers. In order to make full use of the bit space, the clip function is used to limit the input to between 0 and 15. sat is the value produced by the subsequent satRelu.

The relu layer in the backbone network is replaced with the satRelu layer.

In network training, in order to reduce the precision loss caused by low-bit quantization, a satRelu layer needs to be introduced to replace the original relu layer. The definition of the satRelu layer is:

the sat value is a variable, and can be automatically adjusted in the training process to find the optimal value. And the directional propagation process of sat needs to be manually realized in training, and the gradient for sat in the back propagation is as follows:

since the detection network part is responsible for directly outputting the detection result, quantization of the detection result has a great influence on the precision. And the convolution layer of the detection section is not followed by the relu layer, the same method as described above cannot be directly employed. The quantization mode of the part is as follows:

and adding a second weight pseudo quantization layer to the weight part of the convolution layer of the detection part of the target detection network, and adding a second input pseudo quantization layer to the input part of the convolution layer of the detection part.

The addition of a pseudo quantization layer is also required for the convolutional layer of the detection section. But to reduce the loss of precision, 8-bit quantization is used for this partial convolution. The weight pseudo quantization layer is added for the weight of the convolution layer:

wherein, since 8-bit quantization is adopted, the clip clipping range is widened to be between-128 and 127, and the quantization coefficient scale is defined as:

for the input of the convolution layer, an input pseudo quantization layer needs to be added:

where sat is the value produced by satFun in the next step. Since there is no activating function such as relu in the detection network, the input can be positive or negative, and the clip function has a clipping range of-128-127.

A satFun layer is added after the convolutional layer of the detection section.

In network training, in order to reduce the precision loss caused by low-bit quantization, a satFun layer is added after a convolution layer. The definition of the satFun layer is:

To further reduce the loss of accuracy during network training, the following techniques will be employed in network quantization training:

1. gradually decreasing the number of bits

In the training process, the problem of gradient mismatch is reduced by reducing the number of bits every few periods. For the backbone network part, the bit number is gradually reduced to 4 bits in a mode of 16bit- >8bit- >6bit- > 4bit. For the detection network part, the bit number is gradually reduced to 8 bits by adopting a mode of float- > int32- > int 8.

2. Using a focalloss loss function

Focalloss is used to reduce the problem of positive and negative sample imbalance in target detection.

3. Usage data enhancement

After the model is quantized, the detection effect on the small target is obviously reduced, so that in the training process, the detection effect of the small target is improved by adopting a mode of combining two data enhancement algorithms, namely Mosaic and MixUp.

According to the embodiment of the invention, the pseudo quantization layer is added to the main network and the detection part of the target detection network respectively, and the satRelu layer is introduced, so that the compression rate of the model is improved through low-bit quantization under the condition of small model precision loss, the reasoning speed of the model is greatly improved, and the space occupied by model storage is reduced.

It should be noted that, the steps in the embodiments of the method for low-bit quantization of a one-stage target detection network may be intersected, replaced, added, and subtracted, so that the method for low-bit quantization of a one-stage target detection network by using these reasonable permutation and combination transforms should also belong to the protection scope of the present invention, and should not limit the protection scope of the present invention to the embodiments.

Based on the above object, a second aspect of the embodiments of the present invention proposes a system for low-bit quantization of a one-stage object detection network. As shown in fig. 2, the system 200 includes the following modules: the first pseudo-quantization module is configured to insert pseudo-quantization nodes into all convolution layers or all connection layers of a main network of the target detection network, add a first weight pseudo-quantization layer to the weight part of the convolution layers or all connection layers, and add a first input pseudo-quantization layer to the input part of the convolution layers or all connection layers; a first precision module configured to replace a relu layer in the backbone network with a satRelu layer; a second pseudo quantization module configured to add a second weight pseudo quantization layer to a weight portion of a convolution layer of a detection portion of the target detection network, and add a second input pseudo quantization layer to an input portion of the convolution layer of the detection portion; and a second precision module configured to add a satFun layer after the convolutional layer of the detection portion.

In view of the above object, a third aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, inserting pseudo quantization nodes into all convolution layers or all connection layers of a main network of a target detection network, adding a first weight pseudo quantization layer into a weight part of the convolution layers or all connection layers, and adding a first input pseudo quantization layer into an input part of the convolution layers or all connection layers; s2, replacing a relu layer in the backbone network by using the satRelu layer; s3, adding a second weight pseudo-quantization layer to the weight part of the convolution layer of the detection part of the target detection network, and adding a second input pseudo-quantization layer to the input part of the convolution layer of the detection part; and S4, adding a satFun layer after the convolution layer of the detection part.

In some embodiments, the steps further comprise: and training the target detection network by adopting a mode of combining two data enhancement algorithms, namely Mosaic and Mixuup.

As shown in fig. 3, a hardware structure diagram of an embodiment of the computer device for low-bit quantization of a one-stage object detection network according to the present invention is provided.

Taking the example of the device shown in fig. 3, a processor 301 and a memory 302 are included in the device.

The processor 301 and the memory 302 may be connected by a bus or otherwise, for example in fig. 3.

The memory 302 is used as a non-volatile computer readable storage medium for storing non-volatile software programs, non-volatile computer executable programs and modules, such as program instructions/modules corresponding to the method for low bit quantization of a one-phase object detection network in the embodiments of the present application. The processor 301 executes various functional applications of the server and data processing, i.e. a method of implementing low bit quantization for a one-phase object detection network, by running non-volatile software programs, instructions and modules stored in the memory 302.

Memory 302 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of a method of low bit quantization for a one-phase target detection network, and the like. In addition, memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 302 may optionally include memory located remotely from processor 301, which may be connected to the local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more computer instructions 303 corresponding to a method for low bit quantization of a one-phase object detection network are stored in the memory 302, which when executed by the processor 301, perform the method for low bit quantization of a one-phase object detection network in any of the method embodiments described above.

Any one of the embodiments of the computer device that performs the method for low bit quantization of a one-stage object detection network described above may achieve the same or similar effects as any of the method embodiments described above that correspond thereto.

The invention also provides a computer readable storage medium storing a computer program which when executed by a processor performs a method for low bit quantization for a one-phase object detection network.

As shown in fig. 4, a schematic diagram of an embodiment of the computer storage medium for low-bit quantization of a one-stage object detection network according to the present invention is provided. Taking a computer storage medium as shown in fig. 4 as an example, the computer readable storage medium 401 stores a computer program 402 that when executed by a processor performs the above method.

Finally, it should be noted that, as will be appreciated by those skilled in the art, all or part of the procedures in the methods of the embodiments described above may be implemented by a computer program to instruct related hardware, and the program of the method for low bit quantization of a one-stage object detection network may be stored in a computer readable storage medium, where the program when executed may include the procedures of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (RAM), or the like. The computer program embodiments described above may achieve the same or similar effects as any of the method embodiments described above.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and many other variations of the different aspects of the embodiments of the invention as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims

1. A method for low bit quantization for a one-stage image detection network, comprising the steps of:

inserting pseudo quantization nodes into all convolution layers or all connection layers of a backbone network of an image detection network, adding a first weight pseudo quantization layer into the weight part of the convolution layers or all connection layers, and adding a first input pseudo quantization layer into the input part of the convolution layers or all connection layers;

replacing a relu layer in the backbone network with a satRelu layer;

adding a second weight pseudo-quantization layer to a weight part of a convolution layer of a detection part of the image detection network, and adding a second input pseudo-quantization layer to an input part of the convolution layer of the detection part; and

a satFun layer is added after the convolutional layer of the detection section,

the adding a first weight pseudo quantization layer in the weight part of the convolution layer or the full connection layer, and adding a first input pseudo quantization layer in the input part of the convolution layer or the full connection layer comprises:

limiting the input of the first weighted pseudo quantization layer to between-8 and 7, and limiting the input of the first input pseudo quantization layer to between 0 and 15,

the adding a second weight pseudo quantization layer in the weight part of the convolution layer of the detection part of the image detection network, and adding a second input pseudo quantization layer in the input part of the convolution layer of the detection part comprises:

limiting the input of the second weighted pseudo quantization layer to between-128 and 127, and limiting the input of the second input pseudo quantization layer to between-128 and 127,

the definition of the satRelu layer is:

wherein the sat value is a variable which,

the definition of the satFun layer is:

where the sat value is a variable.

2. The method according to claim 1, wherein the method further comprises:

and training the image detection network by adopting a mode of combining two data enhancement algorithms, namely Mosaic and MixuUp.

3. A system for low bit quantization for a one-stage image detection network, comprising:

the first pseudo-quantization module is configured to insert pseudo-quantization nodes into all convolution layers or all connection layers of a backbone network of the image detection network, add a first weight pseudo-quantization layer to the weight part of the convolution layers or all connection layers, and add a first input pseudo-quantization layer to the input part of the convolution layers or all connection layers;

a first precision module configured to replace a relu layer in the backbone network with a satRelu layer;

a second pseudo-quantization module configured to add a second weight pseudo-quantization layer to a weight portion of a convolution layer of a detection portion of the image detection network, and add a second input pseudo-quantization layer to an input portion of the convolution layer of the detection portion; and

a second precision module configured to add a satFun layer after the convolutional layer of the detection portion,

the first pseudo quantization module is configured to:

the second pseudo quantization module is configured to:

the definition of the satRelu layer is:

wherein the sat value is a variable which,

the definition of the satFun layer is:

where the sat value is a variable.

4. A system according to claim 3, wherein the system further comprises a training module configured to:

5. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, which when executed by the processor, perform the steps of the method of any one of claims 1-2.

6. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1-2.