CN113902044A

CN113902044A - Image target extraction method based on lightweight YOLOV3

Info

Publication number: CN113902044A
Application number: CN202111496943.4A
Authority: CN
Inventors: 徐嘉辉; 王彬; 徐凯; 陈石; 赵佳佳; 王中杰
Original assignee: Jiangsu Peregrine Microelectronics Co ltd
Current assignee: Jiangsu Daoyuan Technology Group Co ltd
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-01-07
Anticipated expiration: 2041-12-09
Also published as: CN113902044B

Abstract

The invention discloses an image target extraction method based on lightweight YOLOV3, which improves the backbone network structure of the prior YOLOV3, adopts a deep separable convolution as a basic convolution block, introduces a point convolution for dimension increasing before the deep separable convolution to strengthen the extraction capability of characteristics, and simultaneously introduces residual connection on the premise of ensuring the same downsampling multiple, thereby greatly reducing the parameters of the network and enabling the trained model to be more easily deployed on low-computation embedded equipment. In addition, the method adopts a mode of predicting the target center to realize the detection of the target, reduces the parameters and complexity required by the network head compared with the prior Yolov3, and simultaneously, because a large number of prior frames are not needed, the network does not need to use a non-maximum suppression algorithm during reasoning, thereby greatly increasing the reasoning speed.

Description

Image target extraction method based on lightweight YOLOV3

Technical Field

The invention relates to the field of target detection, in particular to an image target extraction method based on improved YOLOV 3.

Background

The task of object detection, which is to find all objects of interest in an image and determine their category and location, is one of the core problems in the field of computer vision. Because various objects have different appearances, shapes and postures, and interference of factors such as illumination, shielding and the like during imaging is added, target detection is always the most challenging problem in the field of computer vision.

Target detection algorithms based on deep learning are mainly classified into two categories: two stage and One stage. The network of the Tow Stage firstly generates a region, which is called region pro positive, RP for short, that is, a preselected frame that may contain the object to be detected, and then performs sample classification through the convolutional neural network. Common tow stage target detection algorithms are: R-CNN, SPP-Net, Fast R-CNN, and R-FCN, and the like. The One Stage network pursuit speed abandons the two-Stage architecture, namely a separate network is not set to generate a region proxy, and intensive sampling is directly carried out on the feature diagram to generate a large number of prior frames. Common one stage target detection algorithms are: YOLO, SSD, RetinaNet, and the like.

Among them, the YOLO series network is the most classical algorithm in one stage. Firstly, the YOLO algorithm can extract three feature maps with different scales to detect a large target, a medium target and a small target respectively, then a large number of prior frames can be generated on the three feature maps, and then the prior frames are selected through a non-maximum suppression algorithm. The YOLO series network speed has been greatly improved compared to other networks, but the application of YOLO to low-cost devices such as embedded devices has the following problems:

1. the backbone network Darknet of the YOLOV3 references the idea of Resnet, improves the feature extraction capability of the network, but greatly increases the depth and parameters of the network, so that the model trained by the network is large, the model cannot be deployed on low-computing embedded equipment, and the cost of application landing is increased.

2. The prior frame mechanism of YOLOV3 increases the complexity of the network header, increases the network parameters, and at the same time, the network needs to use a non-maximum suppression algorithm to screen the prior frame, so the model takes more time during reasoning.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the prior art, an image target extraction method based on lightweight YOLOV3 is provided, so that parameters of a traditional YOLOV3 network are greatly reduced, and a trained model is easier to deploy on low-computation embedded equipment.

The technical scheme is as follows: an image target extraction method based on lightweight YOLOV3 comprises the following steps:

step 1: constructing a lightweight YOLOV3 network;

the backbone network of the lightweight YOLOV3 network comprises a CBL module and a plurality of Res modules which are connected in sequence, wherein the CBL module comprises a point convolution of 1 x 1, a depth separable convolution, a BN layer and a Leakyrelu, and the Res module comprises two connected CBL modules; the input picture outputs a feature map with three scales after downsampling and feature fusion of the backbone network, wherein the downsampling multiples are 8, 16 and 32 times respectively; under the condition of the downsampling multiple, balancing the feature extraction capability of the network and the number of network parameters by adjusting the number of Res modules;

the Head network of the lightweight YOLOV3 network is composed of three conv convolution layers, and the sizes of the three conv convolution layers are respectively as follows: 1 x cs, 1 x 2, where cs represents the number of classes of the dataset; the three conv convolution layers are output respectively: the method comprises the steps that a central point coordinate prediction value, an offset prediction value of a target central point and a target size prediction value of each category of target of a data set are obtained, wherein the target size refers to the width and the height of a target frame where the target is located;

step 2: training the lightweight YOLOV3 network;

firstly, marking a training set picture, including a target size

Coordinates of the center point of the object

Category c of the target; and calculating to obtain the characteristic graph size of the network output according to the labeling information

Coordinates of center point of target in feature map

Wherein

，

，

Meaning that the rounding is done down,Rrepresents a downsampling multiple; the target size

From the width of the target frame in which the target is locatedWAnd heightHForming;

then, a pixel circle around the center point of the target with r as the radius is subjected to gaussian smoothing processing to obtain:

wherein,

representing pixel coordinates

The confidence of the class of the process c,

the value of (a) is between 0 and 1,

is a standard deviation obtained according to the target size in a self-adaptive manner; the confidence values outside the pixel circle are all set to be 0;

finally, training the network by using image data subjected to Gaussian smoothing;

and step 3: inputting the test picture into the trained lightweight YOLOV3 network for target feature extraction, and outputting the central point coordinate predicted value of each category of targets by the network

Predicted value of offset of target center point

And decoding the coordinates of the upper left corner and the lower right corner of the target frame according to the following formula according to the target size predicted value:

wherein,

and

respectively, representing the predicted values of the width and height of the target dimension.

Further, in the step 2, if two adjacent targets exist in the same picture, the gaussian smoothing processing is performed by taking each target as a center, and the confidence of each pixel at the overlapping portion of the two pixel circles is correspondingly greater.

Further, the radius r is determined by the following equation:

wherein, w and h are the width and height of the target frame where the marked target is located;overlapthe set threshold value represents the intersection ratio of the shifted frame and the target frame.

Further, in step 2, in training the network, the loss function adopted is as follows:

wherein,

，

in order to adjust the coefficients of the loss function,

is a loss function value;

loss function for target center point:

wherein,Nindicating the number of objects in the picture,

all coordinate points of the channel where the c category is located are shown,

representing coordinates

The confidence level obtained by the prediction of the class c,

and

representing a hyper-parameter;

as center point offset loss function:

wherein,

as coordinates of the center point of the object

Is shown in a schematic representation of (a),

indicating the predicted target center point offset,

as coordinates of the center point of the object in the feature map

A schematic representation of;

loss function for target size:

wherein,

and predicting the target size.

Has the advantages that: 1. the invention improves the structure of the existing backbone network of YOLOV3, adopts depth separable convolution as a basic convolution block, introduces a point convolution for dimension increasing before the depth separable convolution to enhance the extraction capability of the characteristics, and introduces residual connection on the premise of ensuring the same downsampling multiple, thereby greatly reducing the parameters of the network and enabling the trained model to be more easily deployed on low-computation-power embedded equipment.

2. The method adopts a mode of predicting the target center to realize the detection of the target, reduces the parameters and complexity required by the network head compared with the prior YOLOV3, and simultaneously, because a large number of prior frames are not needed, the network does not need to use a non-maximum suppression algorithm during reasoning, thereby greatly increasing the reasoning speed.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of the backbone network architecture of the lightweight YOLOV3 of the present invention;

FIG. 3 is a block diagram of the CBL module in the lightweight YOLOV3 of the present invention;

FIG. 4 is a block diagram of the Res module in the lightweight YoloV3 according to the present invention;

FIG. 5 is a block diagram of the Head network in the lightweight YOLOV3 of the present invention;

fig. 6 is a complete block diagram of the lightweight YOLOV3 of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings.

As shown in fig. 1, an image target extraction method based on lightweight YOLOV3 includes:

step 1: a lightweight YOLOV3 network was constructed.

The backbone network of the lightweight YOLOV3 network includes a CBL module and Res modules connected in sequence, as shown in fig. 2. Wherein the CBL module is composed of 1 × 1 dot convolution, depth separable convolution, BN layer and Leakyrelu, as shown in fig. 3. In this embodiment, the size of the input picture is 608 × 608, and feature maps of 76 × 76, 38 × 38, and 19 × 19 are output after the downsampling and feature fusion of the backbone network, that is, the downsampling multiples are 8, 16, and 32 times, respectively.

The Res module includes two CBL modules connected as shown in fig. 4; residual connection is introduced by adding a Res module, so that simple repetition of the convolutional layer is reduced, and the training difficulty of the network is reduced. Under the condition of ensuring the same downsampling multiple, the feature extraction capability of the network and the number of network parameters can be balanced by adjusting the number of Res modules according to the feature complexity of the picture. Specifically, the number of Res modules can be increased to increase the feature extraction capability of the network when the picture is complicated, and the number of Res modules can be decreased to decrease the parameter number and the calculation amount of the network when the picture is simple. The backbone network is connected to the Neck network of the YOLOV3 network.

The Head network of the lightweight YOLOV3 network is composed of three conv convolution layers, the sizes of which are respectively as follows: 1 × cls, 1 × 2, where cls represents the number of classes of the dataset. The three conv convolution layers of the Head network respectively output: the central point coordinate prediction value of each category target of the data set, the offset prediction value of the target central point and the target size prediction value. In this embodiment, the output sizes of the three conv convolution layers are respectively: 19 × cls, 19 × 2; the target size refers to the width and height of the target frame where the target is located.

Compared with the prior art, when the number of categories is 80, the Head network in the prior art measures: 76 × 255 × 128+38 × 255 × 1024=377057280 (377 MFLOPS), the calculated quantities after applying the inventive solution become: 76 × 84 × 128+38 × 84 × 256+19 × 84 × 1024=124207104 (124 MFLOPS). The original parameter number was changed from 255 × 128+255 × 256+255 × 1024=359040 to 84 × 128+84 × 256+84 × 1024= 11827. Meanwhile, compared with the prior art, a large number of prior frames are not required to be generated, so that the time for carrying out non-maximum suppression is saved during network reasoning, and the speed of network reasoning is increased.

The complete YOLOV3 network constructed by the present invention is shown in fig. 6.

Step 2: the constructed lightweight YOLOV3 network was trained.

Firstly, marking a training set picture, including a target size

Coordinates of the center point of the object

Coordinates of center point of target in feature map

Wherein

，

，

Meaning that the rounding is done down,Rrepresents a downsampling multiple; wherein the target size

From the width of the target frame in which the target is locatedWAnd heightHAnd (4) forming.

In order to make the training process smoother, a pixel circle which takes r as a radius around the central point of the target is subjected to Gaussian smoothing processing to obtain:

wherein,

representing pixel coordinates

The confidence of the class of the process c,

the value of (a) is between 0 and 1,

a larger value of (d) represents a more likely target to be detected;

is the standard deviation obtained by self-adaptation according to the target size.

The radius r is determined by the following equation:

wherein, w and h are the width and height of the target frame where the marked target is located;overlapthe threshold value is set to 0.7 in this embodiment, and represents the intersection ratio of the frame after the shift and the target frame. The confidence values outside the pixel circle are all set to 0.

If two adjacent targets exist in the same picture, the above Gaussian smoothing processing is respectively carried out by taking each target as a center, and the confidence coefficient of each pixel at the overlapping part of two pixel circles is correspondingly taken as a larger value.

Finally, the network is trained using the image data for the gaussian smoothing process.

In this embodiment, the face data is used for network training, so the category c of the target is set to 2, that is, the target represents two categories: one is a face and the other is not a face. In training the network, the loss function used is as follows:

wherein,

，

to adjust the coefficients of the loss function, the present embodiment is set to 0.1 and 1;

the loss function value is obtained.

Loss function for target center point:

wherein,Nindicating the number of objects in the picture,

all coordinate points of the channel where the c category is located are shown,

representing coordinates

The confidence degree obtained by predicting the category c;

and

indicating adjustable hyper-parameters, set to 2 and 4, respectively, in this embodiment.

As center point offset loss function:

wherein,

as coordinates of the center point of the object

Is shown in a schematic representation of (a),

indicating the predicted target center point offset,

as coordinates of the center point of the object in the feature map

Is schematically shown.

The predicted target center point coordinate corresponds to the feature map, and the predicted target center point coordinate can be mapped back to the original image through the target center point offset predicted by the network.

Loss function for target size:

wherein,

and predicting the target size.

And step 3: inputting the test picture into a trained lightweight Yolov3 network for target feature extraction, and outputting the central point coordinate predicted value of each category of targets by the network

Predicted value of offset of target center point

wherein,

and

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An image target extraction method based on lightweight YOLOV3 is characterized by comprising the following steps:

step 1: constructing a lightweight YOLOV3 network;

step 2: training the lightweight YOLOV3 network;

firstly, marking a training set picture, including a target size

Coordinates of the center point of the object

Coordinates of center point of target in feature map

Wherein

，

，

wherein,

representing pixel coordinates

The confidence of the class of the process c,

the value of (a) is between 0 and 1,

Predicted value of offset of target center point

wherein,

and

2. The method as claimed in claim 1, wherein in step 2, if two adjacent objects exist in the same picture, the gaussian smoothing process is performed around each object, and the confidence of each pixel in the overlapping portion of two pixel circles is correspondingly larger.

3. The method for extracting image objects based on lightweight YOLOV3 as claimed in claim 1, wherein the radius r is determined by the following formula:

4. The method for extracting image objects based on lightweight YOLOV3 as claimed in any one of claims 1-3, wherein in step 2, the loss function used in training the network is as follows: