CN113591901A

CN113591901A - Target detection method based on anchor frame

Info

Publication number: CN113591901A
Application number: CN202110649575.6A
Authority: CN
Inventors: 赵丹萍; 唐伟; 杨泽文; 江倩; 杜卉; 高世旺; 曹宝龙; 汤兆鑫; 徐瑞东; 田媛; 王振楠; 马永峰
Original assignee: CHINA AEROSPACE TIMES ELECTRONICS CO LTD
Current assignee: CHINA AEROSPACE TIMES ELECTRONICS CO LTD
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-11-02

Abstract

The invention provides an anchor frame-based target detection method, which is used for detecting a target based on an SSD (solid State disk) framework, effectively increasing the detection speed by adopting a Mobilenet network framework in a characteristic extraction part, having high accuracy, better balancing the speed and the accuracy of position detection, stronger robustness, extracting a characteristic image of shallow convolution, improving the detection accuracy of a small target and having wide market application potential.

Description

Target detection method based on anchor frame

Technical Field

The invention belongs to the field of image target detection, and relates to a target detection method based on an anchor frame.

Background

With the continuous development of productivity and the acceleration of urbanization process, the problem of labor shortage is more and more prominent in the field of agricultural planting, and an agricultural robot is urgently needed to be used for completely or partially replacing people to efficiently, safely and reliably complete complex tasks. For example, in the flower planting industry, a large amount of manpower is consumed in the picking process, the number of the current picking farmers is far from reaching the picking requirement, and the picking efficiency is low. Therefore, it is urgently required to develop a picking robot which replaces or assists people to work so as to reduce labor cost. In the flower picking robot, the acquisition of visual information of planted flowers is an important prerequisite for supporting the operation of the flower picking robot. Therefore, the method is very important for accurate positioning and classification of the flower targets.

In recent years, with the rapid development of artificial intelligence, a target detection technology based on deep learning gradually becomes a mainstream technical method in the field of image target detection. The current deep learning methods in the field of target detection are mainly divided into two types: a two-stage target detection algorithm and a one-stage target detection algorithm. The two-stage target detection algorithm firstly generates a series of candidate frames serving as samples, and then carries out sample classification and positioning regression through a convolutional neural network, wherein common algorithms include RCNN, Fast RCNN and the like. The one-stage target detection algorithm is to directly convert the positioning problem of the target frame into a regression problem for processing without generating a candidate frame. Common algorithms are yolo (young only look once), ssd (single Shot multi box detector), etc.

The two-stage and one-stage target detection methods have advantages and disadvantages in detection precision and detection speed. Generally, the detection accuracy of the two-stage target detection method is better than that of the one-stage target detection method, and the detection speed is lower than that of the one-stage target detection method.

The form and the size of a sample in the field of agricultural planting are complex, target detection of objects with different sizes needs to be considered, and three flowers in different growth periods need to be identified in the picking link of the flower planting industry: the existing target detection method is difficult to simultaneously meet the requirements of precision, accuracy and speed required by the agricultural planting field.

Disclosure of Invention

The invention aims to overcome the defects and provides a target detection method based on an anchor frame.

In order to achieve the above purpose, the invention provides the following technical scheme:

an anchor frame-based target detection method comprises the following steps:

s1, collecting a target image Data set to be detected, and labeling to obtain a Data set Data 0;

s2, amplifying Data0 to obtain a Data set Data 1;

s3, dividing Data1 into a training set and a test set, wherein the number of samples contained in the training set is larger than that of the samples contained in the test set;

s4, constructing a target detection neural network model, and training by using a training set to obtain a training model;

s5, the image position is predicted by the training model, and the classification result is output.

Further, the number of samples in the Data set Data0 in step S1 is more than or equal to 200; a single sample includes multiple categories of target images.

Further, the method labeled in step S1 is manual labeling; and manually marking the target area by using a rectangular frame, and arranging four vertex coordinates of the rectangular frame in a clockwise direction from the upper left corner of the image.

Further, the target detection neural network model in step S4 includes a feature extraction module, a target prediction module and a non-maximum suppression module;

the characteristic extraction module is used for extracting a characteristic diagram of the input image step by step;

the target prediction module is used for presetting an anchor frame on each feature map and predicting the position and the category of the feature map after feature fusion;

and presetting an anchor frame on each characteristic graph of the non-maximum value suppression module for removing redundant or invalid anchor frames in the image to be detected, reserving the anchor frame with the maximum target class probability, and outputting the position of the anchor frame.

Further, the anchor frame is a plurality of bounding boxes with different sizes and aspect ratios generated by taking each pixel as a center, and the sizes and the aspect ratios are set according to the result of statistical analysis on the data in the data set.

Furthermore, the target detection neural network model is an SSD target detection model architecture, the loss function adopts DIOU _ loss, and the activation function adopts Mish function。

Further, in the above-mentioned case,the feature extraction part adopts a Mobilene network architecture, and the normalization part adopts a CmBM algorithm; the maximum suppression module adopts a DIoU-NMS algorithm.

Further, the target prediction module is used for selecting feature maps after convolution 3 to convolution 13 in the Mobilenet framework for position prediction.

Further, in step S3, the ratio of the training set Data11 to the test set Data12 is 9:1, 8:2, or 7: 3.

Further, in step S2, the amplification method is Mosaic data enhancement and self-confrontation training; in the method for enhancing the Mosaic data, the combination of the pictures and the combination of the anchor frames are respectively carried out after the four pictures are randomly turned over, zoomed or subjected to color gamut change. Specifically, the Mosaic data enhancement method is to utilize four pictures for splicing enhancement, and specifically realizes the idea that the four pictures are read each time during enhancement, the four pictures are respectively turned over, zoomed, subjected to color gamut change and the like, and well placed according to four directions to combine the pictures and combine frames; the self-confrontation training includes two phases, a first phase in which the neural network modifies the original image, and a second phase in which the neural network is trained to perform the target detection task on the modified image in a normal manner.

Compared with the prior art, the invention has the following beneficial effects:

(1) the method is used for detecting the target based on the SSD architecture, the detection speed is effectively increased by adopting a Mobilenet network structure in the characteristic extraction part, the accuracy is high, the speed and the accuracy of position detection are well balanced, and the robustness is high.

(2) The invention activates the function Mish function through the loss function DIOU _ loss; the CmBM algorithm is normalized, and the DIoU-NMS algorithm in the maximum suppression module is combined for use, so that the accuracy and the detection speed of target detection are improved;

(3) the method extracts the characteristic image of the shallow layer convolution, and improves the detection accuracy of the small target.

(4) The invention has high applicability and wide market application potential.

Drawings

FIG. 1 is a flowchart of an anchor frame-based target detection method according to the present invention.

Detailed Description

The features and advantages of the present invention will become more apparent and appreciated from the following detailed description of the invention.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

An anchor frame-based target detection method comprises the following steps:

s2, amplifying Data0 to obtain a Data set Data 1;

Further, the method labeled in step S1 is manual labeling; and the manual labeling adopts labelimg software to label the position and the type of the target frame, the target area is labeled by a rectangular frame, and the coordinates of four vertexes of the rectangular frame are arranged in the clockwise direction from the upper left corner of the image.

the non-maximum value suppression module is used for removing redundant or invalid anchor frames in the image to be detected, reserving the anchor frame with the maximum target class probability and outputting the position of the anchor frame.

Furthermore, the target detection neural network model adopts DIOU _ loss as a loss function of the SSD target detection model architecture, and the Mish function as an activation function；

Further, in the above-mentioned case,the feature extraction part adopts a Mobilene network architecture, and the normalization part adopts a CmBM algorithm; the maximum value suppression module adopts a DIoU-NMS algorithm

The characteristic extraction part adopts a Mobilene network to lighten a model, and the maximum suppression module adopts a DIoU-NMS algorithm, so that the model recall rate can be improved, and the detection result can be improved. In the Mobilenet framework, feature graphs after convolution 3 to convolution 13 are selected for prediction, and anchor frames with different sizes are preset on each feature graph for prediction.

Example 1

Fig. 1 shows a target detection method for a cotton picking robot, which comprises the following specific steps:

s1 the data set adopted by the invention is 200 rose images manually collected by the unit and used for training the target detection model, wherein each image contains rose images of various categories. When the data set is collected, shooting is carried out from multiple angles in an all-around mode, and the collection types comprise three types, namely flower buds, flower bones and big flowers.

And manually labeling the acquired Data set to obtain a Data set Data 0. During marking, the flower area is marked in a rectangular frame marking mode, namely four vertex coordinates of the target are marked and are arranged in a clockwise direction from the upper left corner of the image.

And S2, performing Data set amplification on the labeled Data set, wherein the amplification method comprises performing Mosaic Data enhancement and self-confrontation training on the Data set Data0, and in the Mosaic Data enhancement method, randomly turning, zooming or color gamut changing four pictures, and outputting to obtain an amplification Data set Data1 with a definite amplification factor.

S3 Data1 Data enhanced by Data is as follows: the scale of 1 is divided into a training set and a test set.

S4, constructing a target detection neural network model and training by using the labeled training data set to obtain an optimal training model. The neural network includes three modules: the device comprises a feature extraction module, a target prediction module and a non-maximum suppression module. The target detection neural network model adopts an SSD (Single Shot Multi Box Detector) target detection model architecture. In the Mobilenet, feature graphs after convolution 3 to convolution 13 are selected for prediction, and each feature graph is preset with a plurality of anchor frames with different sizes for prediction. Wherein, the loss function during training is DIOU _ loss.

S5, based on the trained neural network model, flower position prediction is carried out on the flower image.

As shown in Table 1, the invention adopts a MobileNet network architecture in the SSD algorithm for feature extraction of images; inputting a single image meeting the requirement into a MobileNet network, in the embodiment, firstly adjusting the width, the height and the channel of the image in a training set to 512 × 512 × 3, improving the original standard convolution kernel with the size of 3 × 3 according to the advantage of depth separable convolution of the MobileNet network, forming two types of new convolution, one new convolution is depth level convolution, adopting 3 × 3 convolution to check that each input channel is respectively convoluted, and outputting a feature map from each channel; another new convolution is point convolution, and the convolution core with the size of 1 x 1 is used for carrying out feature fusion on the output feature graphs to form final output; the number of convolution kernels of the depth convolution layer is as follows: 32, 64, 128, 128, 256, 256, 512, 512, 512, 512, 512, 512, 1024; the number of convolution kernels of the point convolution is as follows: 32, 64, 128, 128, 256, 256, 512, 512, 512, 512, 512, 256, 1024; then the output after passing through 14 depth separable convolution layers is sent to an average pooling layer, and finally the result output of the pooling layer is sent to a full-connection layer; finally, a characteristic result of 1 × 1000 is formed, and then the extracted flower characteristic is classified by using a classifier.

Table 1 MobileNet network configuration diagram

The method has the advantages of high detection speed and high precision. Because the cotton picking robot needs to carry out the cotton picking action in real time, higher requirements are put forward for the speed of image target detection, and the detection speed can be greatly increased by adopting a Mobilenet network architecture in the characteristic extraction part of the target detection model constructed by the invention. The invention can meet the application requirement on detection precision and has stronger robustness.

The method can effectively detect the flower images and has wide market application potential.

The invention has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to be construed in a limiting sense. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the technical solution of the present invention and its embodiments without departing from the spirit and scope of the present invention, which fall within the scope of the present invention. The scope of the invention is defined by the appended claims.

Those skilled in the art will appreciate that those matters not described in detail in the present specification are well known in the art.

Claims

1. An anchor frame-based target detection method is characterized by comprising the following steps:

s2, amplifying Data0 to obtain a Data set Data 1;

s5 predicts the image position using the training model.

2. The anchor frame-based target detection method as claimed in claim 1, wherein the number of samples in the Data set Data0 in the step S1 is greater than or equal to 200; the single sample includes a plurality of categories of target images.

3. The anchor-frame-based target detection method according to claim 1, wherein the labeling method in step S1 is manual labeling; and the manual labeling is to label the target area by using a rectangular frame, and arrange the coordinates of four vertexes of the rectangular frame from the upper left corner of the image in a clockwise direction.

4. The anchor-frame-based target detection method of claim 1, wherein the amplification method in step S2 is a method using Mosaic data enhancement and self-confrontation training; in the method for enhancing the Mosaic data, the combination of the pictures and the combination of the anchor frames are respectively carried out after the four pictures are randomly turned over, zoomed or subjected to color gamut change.

5. The anchor-frame-based object detection method of claim 1, wherein the object detection neural network model in step S4 comprises a feature extraction module, an object prediction module and a non-maximum suppression module;

6. The anchor-frame-based object detection method according to claim 5, wherein the anchor frame is a plurality of bounding boxes with different size and aspect ratios generated around each pixel, and the size and aspect ratio is set according to a result of statistical analysis of data in the data set.

7. The anchor frame-based object detection method of claim 1, wherein the object detection neural network model is an SSD object detection model architecture, wherein the model loss function is DIOU _ nms, and the activation function in the convolutional layer is a Mish function.

8. The anchor frame-based target detection method according to claim 5, wherein the feature extraction part adopts a Mobilene network architecture, and the normalization part adopts a CmBM algorithm; the maximum suppression module adopts a DIoU-NMS algorithm.

9. The anchor frame-based target detection method according to claim 5, wherein the target prediction module is configured to select feature maps after convolution 3 to convolution 13 in a Mobilenet architecture for position prediction.

10. The anchor frame-based target detection method as claimed in claim 1, wherein in step S3, the ratio of the training set Data11 to the test set Data12 is 9:1, 8:2, or 7: 3.