CN109214399B

CN109214399B - Improved YOLOV3 target identification method embedded in SENET structure

Info

Publication number: CN109214399B
Application number: CN201811187750.9A
Authority: CN
Inventors: 刘学平; 李玙乾; 刘励
Original assignee: Shenzhen Graduate School Tsinghua University
Current assignee: Shenzhen Graduate School Tsinghua University
Priority date: 2018-10-12
Filing date: 2018-10-12
Publication date: 2021-01-01
Anticipated expiration: 2038-10-12
Also published as: CN109214399A

Abstract

The invention relates to the field of deep learning, in particular to an improved Yolov3 target recognition algorithm embedded in a SENET structure, which comprises the following steps of S100: collecting characteristic information of an object to be identified, and making a data set; the feature information includes image information; step S300: taking one part of the data set as a training set, and taking the other part of the data set as a test set; step S500: embedding an SE structure in a YOLOV3 algorithm to obtain an SE-YOLOV3 algorithm; step S600: training SE-YOLOV3 on a training set; step S700: SE-YOLOVE3 performance was tested on the test set. According to the improved YOLOV3 target identification algorithm embedded in the SENET structure, when more interference of defective parts exists in a sample picture, the method can still accurately identify the target parts, and higher precision ratio and recall ratio are obtained.

Description

Improved YOLOV3 target identification method embedded in SENET structure

Technical Field

The invention relates to the field of deep learning, in particular to an improved Yolov3 target recognition algorithm embedded in a SENET structure.

Background

Deep learning originates from the study of artificial neural networks, with the aim of simulating the human brain to capture and distinguish things. The method is characterized in that low-level features are combined to form an abstract high level, so that distributed features of data are discovered. The Convolutional Neural Network (CNN) is a deep learning method, and has excellent performance for image processing.

SENET is a convolutional neural network structure proposed by Hujie and its team in 2017, and obtains ILSVRC 2017 classification task champion. The structure is composed of three parts of Squeeze, interaction and weight, and the interdependence relation among characteristic channels is explicitly constructed. Since it does not change the size of the feature map, there is no need to introduce new functions, SE has the great advantage of being embedded in almost all network architectures. Experimental results show that by embedding the SE structure in classical networks such as ResNet and inclusion, the generalization capability and accuracy of the model can be remarkably improved.

In the past, image recognition is mainly realized by means of a traditional image processing technology, and features need to be manually extracted and input into a classifier for recognition. With the continuous development of deep learning, various target detection algorithms are proposed to gradually replace the traditional detection method. At present, the image recognition by using the convolutional neural network mainly includes two types: one is a region-based target identification method, such as R-CNN, Faster R-CNN, Mask R-CNN, etc., which has high target positioning accuracy but has the disadvantage of low detection speed; the other is a regression-based target recognition method, such as a method of YOLO.

The YoLO algorithm was proposed by Joseph Redmon et al in 2015, collectively referred to as You Only Look one. The Yolov3 proposed in 2018 is an integrated product of a Yolo series algorithm, has the characteristics of high speed and high accuracy, and can meet the requirement of industrial instantaneity.

At present, convolutional neural networks are often utilized to solve the problem of image recognition. As shown in fig. 6, there are 4 types of target parts to be sorted, and the postures of the parts are arbitrary but do not overlap. For such a problem, the conventional image processing algorithm relying on manual feature extraction is difficult to solve, and can be processed by using YOLOV3 with better performance at present. Due to the fact that positions of parts are disordered, a large number of defective parts (edge positions in fig. 6) exist in a shot picture, when the YOLOV3 network training is used, the defective parts cannot be identified correctly, a background area is judged as the parts by mistake, and the precision ratio on a training set is low (only 72.3%), so that the YOLOV3 algorithm needs to be improved to solve the problems.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides an improved YOLOV3 target identification algorithm embedded in a SENET structure, and when more interference of defective parts exists in a sample picture, the method can still accurately identify the target parts to obtain higher precision ratio and recall ratio.

The technical scheme for solving the problems is as follows: an improved Yolov3 target recognition algorithm embedded in SENET structure, which is characterized by comprising the following steps:

step S100: collecting characteristic information of an object to be identified, and making a data set; the feature information includes image information;

step S300: taking one part of the data set as a training set, and taking the other part of the data set as a test set;

step S500: embedding an SE structure in a YOLOV3 algorithm to obtain an SE-YOLOV3 algorithm;

step S600: training SE-YOLOV3 on a training set;

step S700: SE-YOLOVE3 performance was tested on the test set.

Further, the method also comprises the step S200: carrying out data processing and labeling on the collected characteristic parameters of the object to be identified; wherein the step S200 is between the step S100 and the step S300.

Further, in the step S200, the data processing includes enlarging and/or reducing and/or enhancing and/or reducing and/or flipping and/or adding noise to the acquired feature parameters of the object to be recognized.

Further, in the step S200, the labels are specifically: and marking the acquired characteristic parameters of the object to be identified by using an image marking tool.

Further, the method further comprises the step S400: recalculating an anchor for the data set obtained in step S200, specifically: reading the marked data set, randomly taking out W, H values as initialized seed points, performing iteration by using a K-means algorithm, and calculating to obtain an anchor; wherein the step S400 is between the step S200 and the step S300.

Further, in the step S100, the collecting of the feature information of the object to be recognized is to collect the feature information of the object to be recognized by using a CMOS camera.

Further, in step S500, an SE structure is embedded in the YOLOV3 algorithm to obtain an SE-YOLOV3 algorithm, which specifically includes: embedding an SE structure behind each short cut; modifying the configuration file cfg of the YOLOV3, adding an SE substructure after the 4 th, 8 th, 11 th, 15 th, 18 th, 21 th, 24 th, 27 th, 30 th, 33 th, 36 th, 40 th, 43 th, 46 th, 49 th, 52 th, 55 th, 58 th, 61 th, 65 th, 68 th, 71 th and 74 th shortcuts, and designating the channel value of the global average pooling as the number of feature map channels output by the shortcuts for calculating the global average pooling;

specifically, the channel values are 64, 128, 128, 256, 256, 256, 256, 256, 256, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024 in sequence.

Further, in step S600, SE-YOLOV3 is trained on the training set, and a weight file is obtained after the training is finished.

Further, in step S700, the performance of SE-yoloeve 3 is tested on the test set, specifically: and (5) loading the weight data obtained in the step (S600) and carrying out performance test on the SE-YOLOV3 on the test set.

The invention has the advantages that:

according to the SENET structure embedded improved YOLOV3 target recognition algorithm, the output feature map weight has a global receptive field, information of each channel can be integrated, the expression capability of the feature map can be enhanced, the detection effect of a model in an edge area is improved, when the edge of a sample picture has more incomplete part interference, the target can still be accurately recognized, the false recognition of the edge area is reduced, the number of false positive cases is reduced, and therefore a higher precision ratio is obtained.

Drawings

FIG. 1 is a schematic flow diagram of an embodiment of the present invention;

FIG. 2 is a diagram of a YOLOV3 network architecture;

FIG. 3 is a diagram of the SE-YOLOV3 network architecture of the present invention;

FIG. 4 is a YOLOV3 training data curve;

FIG. 5 is a plot of the SE-YOLOV3 training data of the present invention;

FIG. 6 is a graph of the results of the YOLOV3 assay;

FIG. 7 is a graph showing the results of the test of SE-YOLOV3 according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Referring to fig. 1, the object of the present invention is to provide an improved YOLOV3 target recognition algorithm embedded in a SENet structure, comprising the following steps:

step S100: collecting characteristic information of an object to be identified, and making a data set; the feature information includes image information.

In this embodiment, specifically, an annular red light source is fixed below the CMOS camera, and 4 types of small fastener pictures are collected to obtain an RGB three-channel picture as a convolutional neural network data set.

This example uses a Basler company acA2500-20gc camera to take pictures of class 4 mini fasteners (size 4-7 mm) with a ring red light source fixed under the camera. 529 pictures are obtained, the resolution is 2592 multiplied by 2048 pixels (RGB three channels), and the picture is converted into a single channel by a compiling program so as to improve the calculation speed.

Step S200: and carrying out data processing and labeling on the acquired characteristic parameters of the object to be identified. The data processing comprises the steps of amplifying and/or reducing and/or enhancing the brightness and/or weakening the brightness and/or turning over and/or increasing noise of the acquired characteristic parameters of the object to be identified; the labels are specifically: and marking the acquired characteristic parameters of the object to be identified by using an image marking tool.

The method specifically comprises the following steps: and performing data enhancement on the acquired original picture, and labeling 4 types of parts in the picture by using an image labeling tool Labelimg.

The present embodiment fabricates the acquired image into a data set usable by a convolutional neural network. Writing a program, and performing data enhancement on an original picture by 6 methods of zooming (zooming the image width and the image height to 1.5 times), zooming (zooming the image width to 1/3 and the image height to 1/2 to ensure that the image size is a multiple of 32), brightness enhancement, brightness reduction, overturning (90 degrees and 180 degrees) and noise addition (specklenoid), thereby obtaining 4232 pictures. And (3) marking the 4 types of parts in the 529 pictures by using an image marking tool Labelimg. The specific method comprises the following steps: and (4) selecting all complete parts in the picture frame (ignoring the parts with defective edges of the picture) to obtain the VOC format xml file. Meanwhile, a program needs to be written, and the label file of 3703 pictures subjected to data enhancement is calculated according to the label file of 529 original pictures. The format conversion program is written to convert the xml file into a txt file in "tag" + "X" + "Y" + "W" + "H" format.

Step S300: and taking one part of the data set as a training set, and taking the rest part of the data set as a testing set.

The embodiment specifically includes: 3400 out of the 4232 pictures obtained are randomly selected as a training set, wherein 40612 parts are contained, and the rest 832 parts are used as a test set. Including 9963 parts.

And 4, step 4: and building a YOLOV3 model, and recalculating partial parameters.

In this embodiment, a development platform is first built, a CPU selects intel (r) core (tm) i7-8700, a GPU selects NVIDIA GeForce GTX 1080Ti, an operating system selects Ubuntu 16.04LTS, and a deep learning framework is Pytorch. Then build a YOLOV3 model based on GPU acceleration, as shown in fig. 2. In order to adapt to the data set, part of the parameters need to be modified, and the specific contents include: setting the number of channels in a YOLOV3 algorithm cfg configuration file as 1, and reading in a single-channel gray-scale picture; the number of categories in the cfg configuration file is 4, and the number of channels of the three-scale feature maps (13 × 13, 26 × 26, 52 × 52) for detection is set to be (4+1+4) × 3, namely 27; the label that sets up 4 kinds of parts simultaneously does: type1, type2, type3, type 4; the anchors in the YOLOV3 source code were designed for the coco dataset, and for the datasets herein, the K-means + + algorithm was used to save computation time. The specific method comprises the following steps: w, H data in 4232 txt files which are marked are read, 9 groups of W, H values are randomly taken out to serve as initial seed points, the number of iterations is set to 10000, and then the K-means algorithm is used for iteration. The final calculation yields an anchorbox suitable for the data set herein: (39, 56),(47, 58),(51, 66),(42, 82),(63, 57),(69, 53),(49, 81),(65, 67),(61, 80).

And 5: and embedding an SE structure in the YOLOV3 algorithm to obtain an improved SE-YOLOV3 algorithm.

This example is an improvement over the YOLOV3 model established in step 3. In fig. 3, CBR1 represents a convolution layer with a convolution kernel size of 1 × 1, and the batch normalization operation was performed using ReLU as the activation function. CBR3 is similar to CBR1 except that the convolution kernel size is 3 x 3. The shortcut layer is composed of CBR1 and CBR3, and shortcut n in the figure represents that n (CBR1+ CBR3) substructures are used. The SE-YOLOV3 algorithm is shown in fig. 3, the SE structure performs global average pooling on the input feature map to obtain a feature map (C is the number of feature map channels) with a size of C × 1 × 1, the feature map is activated by a sigmoid function after passing through two full-connected layers (dimensionality reduction and dimensionality increase), a weight with a size of C × 1 × 1 is obtained, and the weight is multiplied by the original input feature map at a corresponding position to obtain an output. The SE-shortcutn layer is composed of a conventional shortcutn layer and an SE structure. After each shortcut, a SE structure is embedded, and SE-shortcut represents that n (CBR1+ CBR3+ SE) substructures are used. Modifying the configuration file cfg of the YOLOV3, adding an SE substructure after the 4 th, 8 th, 11 th, 15 th, 18 th, 21 th, 24 th, 27 th, 30 th, 33 th, 36 th, 40 th, 43 th, 46 th, 49 th, 52 th, 55 th, 58 th, 61 th, 65 th, 68 th, 71 th and 74 th layers (all shortcut layers), and designating the global average pooled channel value as the number of feature map channels output by the shortcut layer for calculating the global average pooling. Specifically, the channel values are 64, 128, 128, 256, 256, 256, 256, 256, 256, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024 in sequence. Taking a 4 th layer short cut layer of YOLOV3 as an example, an SE structure input feature map is 208 × 208 × 64, a feature map of 1 × 1 × 64 is obtained after global average pooling, the feature map is changed into 1 × 1 × 4 after passing through a first full connection layer, the feature map is changed into 1 × 1 × 64 after passing through a second full connection layer, finally a weight value of 1 × 1 × 64 is obtained through a sigmoid activation function, and the input feature map and the weight value are multiplied to obtain an output of 208 × 208 × 64. The network depth of the YOLOV3 is 106 layers, and the network depth of the improved SE-YOLOV3 reaches 129 layers.

Step 6: and respectively training a Yolov3 model and a SE-Yolov3 model on a training set to obtain weight data.

This embodiment is used to train the model. And (3) training the models of YOLOV3 and SE-YOLOV3 built in the steps 3 and 4 on a training set for 150 times from the beginning, wherein the specific parameters are set as follows: the batch number was 12, the gradient optimization used the Adam algorithm, the initial learning rate was 0.0001, the weight attenuation was set to 0.0005, and the learning rate was attenuated to 1/10 after each 50 iterations. The method for judging the true example TP comprises the following steps: in the grid unit where the part exists, the confidence coefficient is greater than 0.99, the intersection ratio (IoU) of the prediction frame (bbox) and the real frame is greater than 0.5, the prediction category is correct, and 3 conditions are met, so that the method is a true example. The false positive case FP contains two cases, the first case: in the grid unit where the part exists, if the confidence coefficient is greater than 0.99 but the part is not a true case, the part is judged to be a false positive case; in the second case: if the confidence is greater than 0.99 in the grid cell where the part does not exist, it is determined as a false positive. The false negative FN determination method comprises the following steps: in the grid cells where the part exists, the confidence is less than 0.99. The precision ratio P is TP/(TP + FP), and the recall ratio R is TP/(TP + FN). The data after each iteration is saved and plotted as a curve, as shown in fig. 4 and 5. And storing the weight data after the training is finished.

And 7: the performance of the model YOLOV3 and SE-YOLOV3 was tested on a test set.

This embodiment is for a test model. Loading two weight data obtained in the step 5 in a model YOLOV3 and a model SE-YOLOV3, and respectively verifying on a test set, wherein the specific parameters are set as follows: the number of batches was 32, the IoU threshold was set to 0.5, the confidence threshold was set to 0.98, and the non-maximum suppression threshold was set to 0.4. Finally, the precision ratio and the recall ratio of the test are obtained, and the test picture is saved, as shown in fig. 6 and 7.

The invention has the following beneficial effects: as can be seen from fig. 4 and 5, on the training set, the precision ratio of YOLOV3 was 72.3%, and the recall ratio was 98.5%, while that of SE-YOLOV3 was 92.5%, and the recall ratio was 96.1%. On the test set, the precision of YOLOV3 was 71.6%, the recall was 97.7%, and the precision of SE-YOLOV3 was 91.9%, the recall was 95.7%. As can be seen from fig. 6 and 7, YOLOV3 has a situation where the edge defect part is identified incorrectly, and at the same time, the background is identified incorrectly as the part, SE-YOLOV3 only identifies the complete part, and there is no situation where the background is identified as the part. By adding the SE structure on the basis of the YOLOV3, the invention integrates the advantages of the YOLOV3 and the SE model, enhances the anti-interference performance of the YOLOV3 algorithm on incomplete parts in images, reduces the number of false positive cases for identification, and improves the precision ratio by 20.3 percent under the condition of ensuring higher recall ratio.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent flow transformations that are made by using the contents of the present specification and the drawings, or directly or indirectly applied to other related system fields, are included in the scope of the present invention.

Claims

1. An improved Yolov3 target recognition method embedded in a SENET structure is characterized by comprising the following steps:

step S500: embedding an SE structure in a YOLOV3 algorithm to obtain an SE-YOLOV3 algorithm, which specifically comprises the following steps: embedding an SE structure behind each short cut; modifying the configuration file cfg of the YOLOV3, adding an SE substructure after the 4 th, 8 th, 11 th, 15 th, 18 th, 21 th, 24 th, 27 th, 30 th, 33 th, 36 th, 40 th, 43 th, 46 th, 49 th, 52 th, 55 th, 58 th, 61 th, 65 th, 68 th, 71 th and 74 th shortcuts, and designating the channel value of the global average pooling as the number of feature map channels output by the shortcuts for calculating the global average pooling;

specifically, the channel values are 64, 128, 128, 256, 256, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024 in sequence;

step S600: training SE-YOLOV3 on a training set;

step S700: SE-YOLOVE3 performance was tested on the test set.

2. The SENET structure embedded improved YOLOV3 target recognition method according to claim 1, wherein: further comprising step S200: carrying out data processing and labeling on the collected characteristic parameters of the object to be identified; wherein the step S200 is between the step S100 and the step S300.

3. The SENET structure embedded improved YOLOV3 target recognition method according to claim 2, wherein: in step S200, the data processing includes amplifying and/or reducing and/or enhancing brightness and/or reducing brightness and/or flipping and/or adding noise to the acquired feature parameters of the object to be identified.

4. The SENET structure embedded improved YOLOV3 target recognition method according to claim 3, wherein: in step S200, the labeling specifically includes: and marking the acquired characteristic parameters of the object to be identified by using an image marking tool.

5. The SENET structure embedded improved YOLOV3 target recognition method according to claim 4, wherein: further comprising step S400: recalculating an anchor for the data set obtained in step S200, specifically: reading the marked data set, randomly taking out W, H values as initialized seed points, performing iteration by using a K-means algorithm, and calculating to obtain an anchor; wherein the step S400 is between the step S200 and the step S300.

6. The SENET structure embedded improved YOLOV3 object recognition method according to any one of claims 1 to 5, wherein: in step S100, the collecting the feature information of the object to be recognized is to collect the feature information of the object to be recognized by using a CMOS camera.

7. The SENET structure embedded improved YOLOV3 object recognition method according to any one of claims 1 to 5, wherein: in step S600, SE-YOLOV3 is trained on a training set, and a weight file is obtained after training is completed.

8. The SENET structure embedded improved YOLOV3 target recognition method according to claim 7, wherein: in step S700, the performance of SE-yoloeve 3 is tested on the test set, specifically: and (5) loading the weight data obtained in the step (S600) and carrying out performance test on the SE-YOLOV3 on the test set.