CN112232263B

CN112232263B - Tomato identification method based on deep learning

Info

Publication number: CN112232263B
Application number: CN202011169184.6A
Authority: CN
Inventors: 梁喜凤; 顾鹏程; 赵力勤; 余文胜; 孙立峰; 徐学珍; 谢文兵; 王永维
Original assignee: China Jiliang University
Current assignee: China Jiliang University
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2024-03-19
Anticipated expiration: 2040-10-28
Also published as: CN112232263A

Abstract

The invention discloses a tomato identification method based on deep learning. Firstly, collecting images of tomatoes under natural conditions, carrying out image data enhancement on the collected images, increasing data samples, and then carrying out artificial annotation of target tomatoes on all the image data, wherein the image data is divided into a training set and a verification set. All pooling layers are removed from the VGG network framework, the residual network is changed, and the convolution layer of the residual network uses hole convolution. According to the method for introducing the cavity convolution layer in the image recognition stage, the tomato recognition accuracy in the complex environment can be improved, and the overall working efficiency of the tomato picking robot can be improved.

Description

Tomato identification method based on deep learning

Technical Field

The invention relates to a tomato identification method, in particular to a tomato identification method based on deep learning.

Background

China is the country of the largest tomato production and consumption in the world, but the lack of labor for the production data activities is serious. The tomato picking robot has important significance for reducing production cost and improving picking efficiency. The accuracy of identification and positioning determines the working efficiency of the tomato picking robot. The growth forms of the tomatoes are different, the fruits are overlapped, the leaves, the branches and the illumination intensity of the tomatoes are different, the fruits are shielded, and the recognition and the positioning of the tomatoes under natural conditions are studied, so that the method has important significance for improving the efficiency of the picking robot.

The tomato identification and detection under the natural environment is to identify target tomatoes in a complex environment by utilizing a computer vision technology, and transmit the obtained position information to a mechanical arm of a tomato picking robot, so that the subsequent tomato picking work can be accurately performed.

The traditional tomato recognition methods in the greenhouse environment are all based on characteristic information of colors or shapes to extract and classify, including histograms of colors or threshold segmentation, classifier classification based on a support vector machine and the like, but the methods do not consider environmental influence factors under the natural complex condition, and are difficult to meet actual requirements.

The convolutional neural network based on the deep learning provides a new method for object recognition, the detection method based on the deep learning can be divided into two types, namely an R-CNN system algorithm based on Region Propos sal and one-stage algorithm such as Yolo and SSD, and the types and positions of different targets can be directly predicted by using only one CNN network.

The tomato can be effectively identified through the deep learning algorithms such as SSD and YOLO, and the tomato identification can be performed on complex natural conditions which cannot be acted by the traditional image algorithm after learning and training of a large number of samples.

In summary, a novel method is provided for tomato identification in natural environment through the SSD network integrated with the cavity convolution.

Disclosure of Invention

The invention aims to provide an SSD convolution neural network fused with cavity convolution, which improves the target recognition rate of a tomato picking robot under complex conditions. A tomato identification and positioning method is provided to solve the problem that a target tomato is difficult to identify in a complex environment.

The technical scheme adopted by the invention is as follows:

a tomato identification method based on deep learning comprises the following steps:

s1: collecting color images of tomatoes under outdoor natural illumination by using a color camera, and constructing a training sample;

s2: carrying out data enhancement on the color image acquired in the step S1 to form a training set;

s3: labeling the tomato position in each sample in the training set in the S2 to obtain a labeling file comprising real frame coordinate parameters;

s4: improving an SSD network architecture, respectively replacing 5 layers of the max pooling layer in the VGG16 network with hole convolution layers, setting the hole convolution layers to be 2 times of downsampling, respectively replacing FC6 layers and FC7 layers in the SSD network with hole convolution layers, and simultaneously respectively replacing conv8_2 layers and conv9_2 layers with hole convolution layers to obtain the improved SSD network architecture;

s5: training the improved SSD network architecture by using the training set, so that the tomato position in the color image can be identified;

s6: and (3) inputting the color image to be detected into the SSD network after training in the step (S5), marking the boundary box with the score larger than the threshold value, and selecting the tomato region as the box selection region.

Preferably, in the step S1, tomato images under various complex conditions need to be collected, and the training samples include different illumination, angles, changes of fruit sizes, single fruits, a plurality of fruits, and situations that the fruits are covered by tomato leaves.

Preferably, the data enhancement includes flipping, panning, cropping, color dithering, and noise enhancement.

Preferably, in S3, the tomato pictures in the training set are manually marked by using a marking file provided by a kit Labelimg.

Preferably, in S4, the downsampling of the 5 layers of the maximum pooling layer in the original VGG16 network is replaced by downsampling of 2 times by 5 layers of the hole convolution layers of which the layers are 2 layers, 3 layers, 6 layers, 9 layers and 14 layers, so as to form 4 residual blocks.

Preferably, in S4, the convolution kernel size of the two-layer hole convolution layer used for replacing FC6 and FC7 is 3×3, the padding is set to 0, the step size is set to 1, the number of intervals between the points of the convolution kernel is 1, and the final receptive field size is 7×7.

Preferably, in S4, the convolution kernel size of the hole convolution layer for replacing the conv8_2 layer is 3×3, and the hole ratio is 2; the convolution kernel size of the hole convolution layer for replacing the conv9_2 layer is 3×3, and the hole rate is 4.

Preferably, in the step S5, frame parameters are required to be configured in advance when training the improved SSD network architecture, the detection type is set to be 1, the matched threshold is set to be 0.5, and the number of training steps is set to be 20000.

According to the invention, all pooling layers are removed from a VGG network frame in an SSD network architecture, a residual network is changed, and a convolution layer of the residual network uses hole convolution. Because the size of the low-level feature map used in the SSD algorithm is larger, the size of the high-level feature map is smaller, the size of the bottom-level feature map is reduced by utilizing the downsampling operation, and the cavity convolution replaces the maximum pooling layer, so that the feeling of the feature map is enlarged, and more effective global information is obtained by the feature map elements. Therefore, the method for introducing the cavity convolution layer in the image recognition stage can improve the tomato recognition accuracy in a complex environment, and is beneficial to improving the overall working efficiency of the tomato picking robot.

Drawings

Fig. 1 is a flow chart of recognition of tomatoes based on deep learning.

Fig. 2 is a modified view of an SSD network frame.

Fig. 3 is a residual block network diagram.

FIG. 4 is a graph showing the detection results in one embodiment.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples.

Referring to fig. 1, a basic operation flowchart of a tomato recognition method based on deep learning according to a preferred embodiment of the present invention is shown, wherein the method specifically comprises the following steps:

step 1, image acquisition

And collecting color images of the tomatoes under outdoor natural illumination by using a color camera, and constructing a training sample. When the training samples are acquired, the image samples under different conditions should be selected as much as possible under the natural illumination condition so as to enrich the sample types. The method can select different times of the day and different illumination degrees, collect images, select different illumination, angles, changes of fruit sizes, single fruits, a plurality of fruits, and the fruits are shielded by tomato leaves to collect images as much as possible.

Step 2, image data enhancement

And (3) carrying out data enhancement on the color image acquired in the step (1) to form a training set. The data enhancement is a method capable of improving algorithm robustness under the condition of not reducing detection precision, and the collected image data of tomatoes is often too small in sample under natural illumination, and the method comprising horizontal overturning, translation, shaking and noise addition is adopted for enhancing the image data.

Step 3, sample labeling

And (3) manually labeling all the enhanced image samples in the step (2), and labeling 4-tuple parameters (Xmin, ymin, xmax, ymax) containing the region of interest (namely the tomato position), wherein the 4-tuple parameters respectively represent the upper left corner coordinates (Xmin, ymax) and the lower right corner coordinates (Xmax, ymin) of the labeling frame, so that a labeling file comprising real frame coordinate parameters is obtained. The annotation file is used for judging the matching score according to the coincidence rate of the output frame and the annotation frame when evaluating the accuracy of the model. In this embodiment, the tomato pictures in the training set are manually marked by the marking file provided by the tool box Labelimg.

Step 4, improved SSD network architecture construction

The conventional SSD object detection framework is based on a feedforward neural network, the structure of which is described in the prior art in Liu W, anguelov D, erhan D, et al SSD Single Shot MultiBox Detector [ C ]// European Conference on Computer Vision Springer International Publishing,2016.

The SSD frame uses the VGG16 of the full convolution layer as a base network to directly predict multi-destination categories and peripheral frames with feature maps, but in the present invention improvements to the SSD network architecture are needed so that it can accurately identify tomatoes in an image. In this embodiment, the basic structure of the SSD network is unchanged, and only a part of layers in the SSD network are replaced by hole convolution layers, which is specifically as follows:

in one aspect, 5 layers of the max-pooling layer in the VGG16 network are replaced with the hole convolution layers, respectively, and the hole convolution layers are set to 2 times downsampling. On the other hand, the FC6 layer and the FC7 layer in the SSD network are replaced with the hole convolution layers, respectively, while the conv8_2 layer and the conv9_2 layer are replaced with the hole convolution layers, respectively. After the replacement, an improved SSD network architecture is obtained, which can be seen in fig. 2.

The specific parameters of the cavity convolution layer can be adjusted according to the needs. In this embodiment, the number of layers of the hole convolution layer for replacing the 5-layer maximum pooling layer in the VGG16 network is 2,3,6,9, and 14 in order. In the improved SSD network frame, 5 layers of maximum pooling layers are removed, 2 times downsampling is carried out by using 2,3,6,9 and 14 layers of cavity convolution layers to replace downsampling of the 5 layers of maximum pooling layers in the original VGG16 network, a depth residual network with 4 residual blocks is formed, and the structure is that 'building block' reduces convolution parameters needing to be calculated. The residual network structure can be seen in fig. 3.

In addition, the convolution kernel size of the two-layer hole convolution layers for replacing FC6 and FC7 is 3×3, the padding is set to 0, the step size is set to 1, the number of intervals of the points of the convolution kernel is set to 1, and thus the final receptive field size is 7×7. The convolution kernel size of the hole convolution layer for replacing the conv8_2 layer is 3×3, and the hole rate is set to 2; the convolution kernel size of the hole convolution layer for replacing the conv9_2 layer is 3×3, and the hole rate is set to 4.

In the invention, the SSD network combined with the cavity convolution is adopted, so that the information loss of the pooling operation can be avoided, and under the condition of the same calculation condition, the receptive field is enlarged, and each convolution output contains a larger range of information. The feature map size calculation method obtained after cavity convolution is as follows:

wherein i is the size of the input feature map; p is the padding value size; k is the convolution kernel size; d is an expansion factor, and the hole convolution operation is completed on the elements with d-l intervals; s is the step size.

In the improved SSD network, the form of the loss function is substantially similar to a traditional SSD network. Generating a set of bounding boxes of fixed size and confidence in the object class in the boxes in the multi-layer feature map:

wherein c is the confidence of the Softmax function on each category, N is the number of matching default frames, the weight term alpha is set to 1 through cross validation, x is defined as the default frame matching value, l is defined as the predicted frame, and g is defined as the real label value; l (L) _conf Defined as confidence loss, L _loc Defined as a loss of position.

Wherein the position loss is a smooth L1 loss between the prediction box (L) and the real label box (g), as shown in formula (2):

pos is defined as a default box in this formula; the box is defined as the central coordinate of the prediction frame and the width and height of the prediction frame; x is x _ij Defined as the matching value of the i default box to the j real label box of class k.

In this embodiment, the SSD network extracts 4 or 6 default frames according to different aspect ratios for each unit of the conv4_3, fc7, conv8_2, conv9_2, conv10_2 and conv11_2 feature layers, and finally obtains 8732 default frames.

Step 5, model training

After the improved SSD network architecture is built, training the improved SSD network architecture by utilizing the training set and the labeling file obtained in the steps, so that the tomato position in the input color image can be identified. When training the network, frame parameters are required to be configured in advance, and in the embodiment, only tomatoes need to be detected, so that the detection type can be set to be 1, the matched threshold value is 0.5, and the training step number is 20000.

In the training stage, a formula (1) is used for carrying out regression analysis of coordinate offset for a predicted frame, each default frame in the testing stage is matched with a labeling frame in an overlapping rate, and the matching scores are ranked from high to low.

Because the size of the low-level feature map used in the SSD algorithm is larger, the size of the high-level feature map is smaller, the size of the bottom-level feature map is reduced by utilizing the downsampling operation, the cavity convolution replaces the maximum pooling layer, the receptive field of the feature map is enlarged, and more effective global information is obtained by the feature map elements.

Step 6, tomato identification

And inputting the color image to be detected into the trained SSD network, marking the boundary box with the score being more than 50% of the threshold value, and selecting the tomato region as the box selection region. In this embodiment, the test set is input into the network model, the detection result is obviously better than that of the traditional SSD network model, and fig. 4 shows a tomato bounding box detected in one of the pictures, so that it can be seen that the position of the tomato is accurately identified in general.

Therefore, the convolution layer is used for downsampling instead of the maximum pooling layer, more characteristic information can be reserved, and the characteristic multiplexing and fusion of the front layer and the rear layer are realized by introducing a residual structure into the main network, so that the rapidness and the accuracy of extracting the characteristics of the tomato fruits are simultaneously met.

The above embodiment is only a preferred embodiment of the present invention, but it is not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, all the technical schemes obtained by adopting the equivalent substitution or equivalent transformation are within the protection scope of the invention.

Claims

1. The tomato identification method based on deep learning is characterized by comprising the following steps:

s6: inputting the color image to be detected into the SSD network trained in the step S5, marking a boundary box with a score larger than a threshold value, wherein a box selection area is a detected tomato area;

in the step S4, the downsampling of 5 layers of maximum pooling layers in the original VGG16 network is replaced by downsampling of 2 times of 5 layers of cavity convolution layers with the layers of 2 layers, 3 layers, 6 layers, 9 layers and 14 layers respectively to form 4 residual blocks;

in the step S4, the convolution kernel size of the cavity convolution layer for replacing the conv8_2 layer is 3×3, and the cavity rate is 2; the convolution kernel size of the hole convolution layer for replacing the conv9_2 layer is 3×3, and the hole rate is 4.

2. A deep learning-based tomato recognition method as claimed in claim 1, wherein in S1, a plurality of complex tomato images are acquired, and the training samples include the changes of different illumination, angles and fruit sizes, single fruit, a plurality of fruits, and the situation that the fruits are covered by tomato leaves.

3. A deep learning based tomato recognition method as claimed in claim 1, wherein the data enhancement includes flipping, panning, clipping, color dithering and noise enhancement.

4. The method for recognizing tomatoes based on deep learning as claimed in claim 1, wherein in S3, the tomato pictures in the training set are manually marked by using a marking file provided by a kit Labelimg.

5. A tomato recognition method based on deep learning as claimed in claim 1, wherein in S4, the convolution kernel size of the two-layer hole convolution layer for replacing FC6 and FC7 is 3 x 3, the padding is set to 0, the step size is set to 1, the number of intervals of the points of the convolution kernel is 1, and the final receptive field size is 7 x 7.

6. The tomato recognition method based on deep learning as claimed in claim 1, wherein in the step S5, frame parameters are required to be configured in advance when training the improved SSD network architecture, the detection type is set to be 1, the matched threshold is 0.5, and the training step number is 20000.