CN113887395A

CN113887395A - Depth separable convolution YOLOv4 model-based filter bag opening position detection method

Info

Publication number: CN113887395A
Application number: CN202111152052.7A
Authority: CN
Inventors: 王宪保; 余皓鑫; 周宝; 陈科宇; 雷雅彧; 翁扬凯
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-04

Abstract

The invention discloses a filter bag mouth position detection method based on a depth separable convolution YOLOv4 model, which comprises the steps of firstly collecting a filter bag mouth picture on a conveyor belt through a camera; then, performing data enhancement on all pictures, labeling the pictures and constructing a data set of a bag opening of the filter bag; then constructing a YOLOv4 target detection model based on the depth separable convolution, setting training parameters, and training the YOLOv4 target detection model based on the depth separable convolution by adopting a filter bag opening data set; and finally, inputting the picture of the filter bag opening to be detected into a trained YOLOv4 target detection model based on depth separable convolution, and outputting the picture of the filter bag opening to be detected, wherein the picture of the filter bag opening is marked with the position of a detection frame and the type of the filter bag opening. The method of the invention reduces the total parameter number of the model and improves the calculation speed of the model.

Description

Depth separable convolution YOLOv4 model-based filter bag opening position detection method

Technical Field

The invention relates to the field of computer vision, in particular to the field of image detection and identification, and particularly relates to a filter bag opening position detection method based on a depth separable convolution YOLOv4 model.

Background

In recent years, domestic economy in China is rapidly developed, the living standard and quality of people are improved to a certain extent, the industrial standard is greatly changed compared with the past, and many industrial manufacturers are affected due to inaccurate positioning of product positions while production efficiency is guaranteed. The bag mouth of the filter bag occupies an important position in the intelligent production process of the filter bag, but the traditional target detection method is difficult to realize and the detection precision cannot meet the production requirement of a factory due to the characteristic of flexibility of the filter bag. Therefore, an effective method for automatically positioning the accurate position of the bag mouth of the filter bag is needed, and the automation degree and the production efficiency of a factory assembly line are improved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a filter bag opening position detection method based on a depth separable convolution YOLOv4 model, which has the following specific technical scheme:

a filter bag opening position detection method based on a depth separable convolution YOLOv4 model specifically comprises the following steps:

s1: collecting pictures of a filter bag opening on a conveyor belt through a camera;

s2: performing data enhancement on all the pictures obtained in the step S1, labeling the pictures, and constructing a filter bag opening data set;

s3: constructing a YOLOv4 target detection model based on the depth separable convolution, setting training parameters, and training the YOLOv4 target detection model based on the depth separable convolution by adopting the filter bag opening data set constructed in the S2;

s4: inputting the picture of the filter bag opening to be detected into a trained YOLOv4 target detection model based on depth separable convolution, and outputting the picture of the filter bag opening to be detected with the position of a detection frame and the type of the filter bag opening marked.

Further, the data enhancement in S2 is specifically: sharpening the picture of the bag opening of the filter bag, then randomly rotating the image, changing the brightness and turning the image in a mirror image manner, expanding the image, and then labeling all the images to obtain a data set of the bag opening of the filter bag.

Further, the S3 specifically includes the following sub-steps:

s3.1: constructing a YOLOv4 target detection model based on depth separable convolution, wherein the YOLOv4 target detection model based on depth separable convolution comprises an input layer, a feature extraction network, a bottleneck structure and a prediction layer; the feature extraction network adopts a MobileNet V3 neural network and adopts a depth separable convolution in the MobileNet V3 neural network as a convolution in a bottleneck structure in YOLOv 4;

s3.2: the YOLOv4 target detection model based on the depth separable convolution is used for clustering prior frames by adopting a K-means + + clustering algorithm before detecting the bag openings of the filter bags to obtain 9 prior frames with proper sizes;

s3.3: inputting a data set of a bag opening of a filter bag into a YOLOv4 target detection model based on depth separable convolution, extracting picture characteristics by a characteristic extraction network, and obtaining characteristic graphs of different scales;

s3.4: inputting feature graphs of different sizes into a bottleneck structure, and repeatedly extracting features of the pictures to finally obtain three feature graphs of different sizes;

s3.5: and (3) predicting on the three feature maps with different sizes according to the initial prior frame size obtained in the step (S3.2), carrying out non-maximum suppression processing on the obtained prediction result, screening out a prediction frame with the highest intersection ratio with the real frame, sending the prediction frame into a prediction layer to adjust the prediction frame, and updating the parameters of the target detection model through back propagation.

Further, the S3.2 is specifically realized by the following sub-steps:

s3.2.1: firstly, randomly selecting N real frames from a data set at the opening of a filter bag asAn initial prior box, where N equals 9; s3.2.2: according to the formula: d_mn＝1-IoU(box_m,anchor_n) To calculate the distance d between the unselected prior box and the initial prior box_mnWhen d is_mnWhen the minimum time is reached, the corresponding mth real frame belongs to the nth cluster class, wherein the box_mThe number of the real frames is M, M belongs to M, and M is the total number of the real frames; anchor_nIs the nth prior frame, N belongs to N; IoU is the ratio of the intersection to the union of the two, i.e.

S3.2.3: after clustering all real frames in a data set at the bag opening of the filter bag, updating the width and height of an initial prior frame by using the average value of the width and height of the real frames contained in each cluster;

s3.2.4: and repeating S3.2.2 and S3.2.3 until the widths and heights of the N initial prior frames are not changed or the maximum iteration number is reached, and taking the final N initial prior frame sizes as the prior frame sizes of the filter bag opening data set.

Further, when the target detection model is trained in S3, 4 images are randomly selected from the data set at the bag opening of the filter bag, and are first randomly scaled and randomly stitched and mixed by a Mosaic data enhancement method to obtain a new stitched image, and then the stitched image is used to train the target detection model.

Further, in the process of training the target detection model in S3, the learning rate adopts a cosine annealing attenuation strategy, thereby effectively preventing the model from falling into a local minimum.

Further, in the training process of the target detection model in S3, a label smoothing strategy is adopted as a way to avoid overfitting, and a calculation formula of label smoothing is as follows:

wherein the content of the first and second substances,

for the label after smoothing, y_KFor the original label, K is the number of classes and ε is the label smoothing value.

Further, in the training process of the target detection model in S3, a CIoU loss function is used:

where IoU represents the cross-over ratio, A represents the prediction box, B represents the real box of the object, B represents the center point of A^gtRepresents the center point of B, ρ²Represents the square of the distance between the center of the prediction box and the center of the real box, c²Representing the square of the diagonal distance of the minimum circumscribed rectangle containing the prediction frame and the real frame; v is a parameter for measuring the uniformity of the aspect ratio, ω^gtIs the width of the real frame, h^gtIs the height of the real box, ω is the width of the prediction box, h is the height of the prediction box, α is the weight coefficient,

the invention has the following beneficial effects:

(1) according to the invention, the characteristic extraction network of the YOLOv4 target detection model is replaced by the MobileNetV3 neural network which is more suitable for embedded equipment, so that the total parameter number of the model is reduced, and the calculation speed of the model is improved.

(2) The invention replaces the convolution in the bottleneck structure of the Yolov4 target detection model with the depth separable convolution, and further reduces the parameter quantity of the Yolov4 model.

(3) According to the method, a K-means + + clustering algorithm is introduced to a selection scheme of an initial prior frame of a YOLOv4 target detection model, and an output result of the K-means + + clustering algorithm is used as the size of the initial prior frame of the YOLOv4 target detection model based on depth separable convolution.

Drawings

FIG. 1 is a flow diagram of the deep separable convolved Yolov4 target detection model.

FIG. 2 is a block diagram of the deep separable convolution YOLOv4 target detection model.

FIG. 3 is a graph showing the results of the detection of the mouth of a filter bag, wherein the graph (a) shows a cross-pattern circular opening, (b) shows a circular protrusion opening, (c) shows a circular breakage opening, (d) shows a cross-line circular opening, (e) shows a circular depression opening, (f) shows a green cross-pattern circular opening, (g) shows a white oval opening, and (h) shows a purple oval opening.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and preferred embodiments, and the objects and effects of the present invention will become more apparent, it being understood that the specific embodiments described herein are merely illustrative of the present invention and are not intended to limit the present invention.

The principle of the invention is as follows: firstly, collecting pictures of a bag opening of a filter bag, performing data enhancement on the pictures, labeling all the obtained pictures, and dividing the pictures into a training set, a verification set and a test set. And then constructing a YOLOv4 target detection model based on the depth separable convolution, adopting the output result of the K-means + + clustering algorithm as the size of an initial prior box of the YOLOv4 target detection model based on the depth separable convolution, and converting the feature extraction network into a MobileNetV3 neural network with smaller total parameter number. And finally, verifying the performance of the YOLOv4 target detection model based on the depth separable convolution on the test set, and verifying the bag mouth classification and bag mouth positioning effects of the filter bags on the test set.

As one embodiment, the method for detecting the bag mouth position of the filter bag based on the depth separable convolution YOLOv4 model, as shown in FIG. 1, comprises the following steps:

s2: performing data enhancement on all the pictures obtained in S1, wherein the specific data enhancement mode is as follows: sharpening a picture of a bag opening of a filter bag, randomly rotating the image, changing brightness and turning a mirror image, expanding the image, and labeling all the images to obtain a data set of the bag opening of the filter bag; and then according to the weight ratio of 8: 1: 1, dividing the training set, the verification set and the test set;

s3: constructing a YOLOv4 target detection model based on the depth separable convolution, setting training parameters, and training the YOLOv4 target detection model based on the depth separable convolution by adopting the filter bag opening data set constructed in the S2; s3 specifically includes the following substeps:

s3.1: constructing a YOLOv4 target detection model based on depth separable convolution, wherein the YOLOv4 target detection model based on depth separable convolution comprises an input layer, a feature extraction network, a bottleneck structure and a prediction layer, as shown in FIG. 2; the feature extraction network adopts a MobileNet V3 neural network, and adopts a deep separable convolution in the MobileNet V3 neural network as a convolution in a bottleneck structure in YOLOv4, so that the parameter quantity of a YOLOv4 model is further reduced;

the MobilenetV3 neural network is used as the feature extraction network of YOLOv4, so that the calculation amount of the whole model can be reduced, and the running speed of the model can be obviously improved. MobilenetV3 combines the depth separable convolution in MobilenetV1 and the inverse residual structure in MobilenetV2, and the parameter pairs of the depth separable convolution and the normal convolution are shown in the formula:

D_Kis the size of the convolution kernel, D_FThe sizes of the input characteristic diagram and the output characteristic diagram are obtained, M is the number of channels of the input characteristic diagram, N is the number of channels of the output characteristic diagram, the depth separable convolution can effectively reduce the number of parameters, and the loss of high-dimensional information can be reduced by an inverse residual structure;MobilenetV3 introduces a lightweight attention mechanism to adjust the weight of each channel, and adopts an h-swish activation function to replace a swish activation function, so that the calculated amount is reduced, and the model performance is improved, wherein the formula of the h-swish activation function is

S3.2: when a YOLOv4 target detection model based on deep separable convolution detects the bag mouth of a filter bag, clustering is carried out on a prior frame by adopting a K-means + + clustering algorithm to obtain 9 prior frames with proper sizes, and the method specifically comprises the following steps:

s3.2.1: firstly, randomly selecting N real frames from a data set at the opening of a filter bag as initial prior frames, wherein N is equal to 9;

s3.2.2: according to the formula: d_mn＝1-IoU(box_m,anchor_n) To calculate the distance d between the unselected prior box and the initial prior box_mnWhen d is_mnWhen the minimum time is reached, the corresponding mth real frame belongs to the nth cluster class, wherein the box_mThe number of the real frames is M, M belongs to M, and M is the total number of the real frames; anchor_nIs the nth prior frame, N belongs to N; IoU is the ratio of the intersection to the union of the two, i.e.

s3.2.4: s3.2.2 and S3.2.3 are repeated until the width and height of the N initial prior frames are not changed or the maximum iteration number is reached, and the final N initial prior frame sizes are used as the prior frame sizes of the filter bag opening data set, as shown in Table 1.

TABLE 1 Anchor frame size table

In order to effectively improve the robustness of the model, when the target detection model is trained in S3, 4 images are randomly selected from a bag opening data set of a filter bag, and are subjected to random scaling and random splicing mixing by a Mosaic data enhancement method to obtain a new spliced image, and then the spliced image is adopted to train the target detection model.

In order to improve the performance of the model, in the training process of the target detection model in S3:

(1) the learning rate adopts a cosine annealing attenuation strategy, so that the model is effectively prevented from falling into a local minimum value, and cosine functions are adopted in the rising period and the falling period;

(2) the overfitting can be effectively avoided by adopting a label smoothing strategy, and the calculation formula of the label smoothing is as follows:

wherein the content of the first and second substances,

(3) The CIoU loss function is used instead of the original loss function:

wherein IoU represents the cross-over ratio, CIoU is called complete interaction over unit, A represents the prediction box, B represents the real box of the object, B represents the center point of A^gtRepresents the center point of B, ρ²Represents the square of the distance between the center of the prediction box and the center of the real box, c²Representing the square of the diagonal distance of the minimum circumscribed rectangle containing the prediction frame and the real frame; v is a parameter for measuring the uniformity of the aspect ratio, ω^gtIs the width of the real frame, h^gtIs the height of the real box, ω is the width of the prediction box, h is the height of the prediction box, α is the weight coefficient,

FIG. 3 shows the detection result of the filter bag mouth detection by the method of the present invention, and it can be seen from the figure that the method of the present invention can accurately detect the positions of the filter bag mouths for various types of filter bag mouths. And the original parameters of YOLO-V4 were 244M, while the parameters of the model were 53.8M.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and although the invention has been described in detail with reference to the foregoing examples, it will be apparent to those skilled in the art that various changes in the form and details of the embodiments may be made and equivalents may be substituted for elements thereof. All modifications, equivalents and the like which come within the spirit and principle of the invention are intended to be included within the scope of the invention.

Claims

1. A filter bag opening position detection method based on a depth separable convolution YOLOv4 model is characterized by comprising the following steps:

2. The method for detecting the bag mouth position of the filter bag based on the deep separable convolution YOLOv4 model as claimed in claim 1, wherein the data enhancement in S2 specifically is: sharpening the picture of the bag opening of the filter bag, then randomly rotating the image, changing the brightness and turning the image in a mirror image manner, expanding the image, and then labeling all the images to obtain a data set of the bag opening of the filter bag.

3. The method for detecting the bag mouth position of the filter bag based on the deep separable convolution YOLOv4 model as claimed in claim 1, wherein the S3 specifically comprises the following sub-steps:

4. The method for detecting the bag mouth position of the filter bag based on the deep separable convolution YOLOv4 model as claimed in claim 3, wherein the S3.2 is implemented by the following steps:

s3.2.2: according to the formula: d_mn＝1-IoU(box_m,anchor_n) To calculate the distance d between the unselected prior box and the initial prior box_mnWhen d is_mnWhen the minimum time is reached, the corresponding mth real frame belongs to the nth clusterClass i, wherein box_mThe number of the real frames is M, M belongs to M, and M is the total number of the real frames; anchor_nIs the nth prior frame, N belongs to N; IoU is the ratio of the intersection to the union of the two, i.e.

5. The method for detecting the position of the bag mouth of the filter bag based on the depth separable convolution YOLOv4 model as claimed in claim 3, wherein when the target detection model is trained in S3, 4 images are randomly selected from the bag mouth data set, and are randomly scaled and randomly spliced and mixed by a Mosaic data enhancement method to obtain a new spliced image, and then the target detection model is trained by using the spliced image.

6. The method for detecting the bag mouth position of the filter bag based on the deep separable convolution YOLOv4 model as claimed in claim 3, wherein in the training process of the target detection model in S3, the learning rate adopts a cosine annealing attenuation strategy, so that the model is effectively prevented from falling into a local minimum.

7. The method for detecting the bag mouth position of the filter bag based on the deep separable convolution YOLOv4 model as claimed in claim 3, wherein in the training process of the target detection model in S3, a label smoothing strategy is adopted as a way to avoid overfitting, and the calculation formula of label smoothing is as follows:

wherein the content of the first and second substances,

8. The method for detecting the bag mouth position of the filter bag based on the deep separable convolution YOLOv4 model as claimed in claim 3, wherein in the training process of the target detection model in S3, a CIoU loss function is used: