CN112232263A

CN112232263A - Tomato identification method based on deep learning

Info

Publication number: CN112232263A
Application number: CN202011169184.6A
Authority: CN
Inventors: 梁喜凤; 顾鹏程; 赵力勤; 余文胜; 孙立峰; 徐学珍; 谢文兵; 王永维
Original assignee: China Jiliang University
Current assignee: China Jiliang University
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2021-01-15
Anticipated expiration: 2040-10-28
Also published as: CN112232263B

Abstract

The invention discloses a tomato identification method based on deep learning. Firstly, images of tomatoes under natural conditions are collected, image data enhancement is carried out on the collected images, data samples are increased, then manual marking of target tomatoes is carried out on all the image data, and the image data are divided into a training set and a verification set. And removing all pooling layers of the VGG network framework, changing the VGG network framework into a residual error network, and performing convolution on convolution layers of the residual error network by using holes. According to the invention, the method of introducing the void convolutional layer in the image recognition stage can improve the tomato recognition accuracy in a complex environment, and is beneficial to improving the whole machine working efficiency of the tomato picking robot.

Description

Tomato identification method based on deep learning

Technical Field

The invention relates to a tomato identification method, in particular to a tomato identification method based on deep learning.

Background

China is the world with the largest tomato production and consumption, but the labor force for engaging in production data activities is seriously lacked, and China has an aged population structure which is not in line with the national conditions in the stacks, so that the tomato picking robot has important significance for reducing the production cost and improving the picking efficiency. The accuracy of identification and positioning determines the working efficiency of the tomato picking robot. The tomato fruits are different in growth form, the fruits are overlapped, the tomato leaves, the tomato branches and the tomato illumination intensity are different, the fruits can be shielded, and the research on the identification and the positioning of the tomatoes under natural conditions has important significance for improving the efficiency of the picking robot.

The tomato identification and detection under the natural environment is to identify a target tomato under a complex environment by utilizing a computer vision technology and transmit the obtained position information to a mechanical arm of a tomato picking robot, so that the subsequent tomato picking work can be accurately carried out.

The traditional tomato identification method under the greenhouse environment is based on color or shape characteristic information for extraction and classification, including color histogram, or threshold segmentation, classifier classification based on support vector machine, and the like, but the methods do not consider environmental influence factors under natural complex conditions, and are difficult to meet actual requirements.

The convolutional neural network based on deep learning provides a new method for object identification, and the detection method based on deep learning can be divided into two types, one type is an R-CNN algorithm based on Region Proposal, the other type is a one-stage algorithm such as Yolo and SSD, and the types and positions of different targets can be directly predicted by only using one CNN network.

The tomatoes can be effectively identified through algorithms based on deep learning, such as SSD and YOLO, and after learning training of a large number of samples, the tomato recognition method has a strong migration function and can be used for recognizing the tomatoes under complex natural conditions which cannot be acted by the traditional image algorithm.

In summary, a new method is provided for identifying tomatoes in a natural environment through the SSD network integrated with the hole convolution.

Disclosure of Invention

The invention aims to provide an SSD convolutional neural network fused with cavity convolution to improve the target recognition rate of a tomato picking robot under a complex condition. The method for identifying and positioning the tomatoes is provided, and is used for solving the problem that the target tomatoes are difficult to identify in a complex environment.

The technical scheme adopted by the invention is as follows:

a tomato identification method based on deep learning comprises the following steps:

s1: collecting color images of tomatoes under outdoor natural illumination by using a color camera, and constructing a training sample;

s2: performing data enhancement on the color image acquired in the S1 to form a training set;

s3: marking the position of the tomato in each sample in the training set in the S2 to obtain a marking file comprising the coordinate parameters of the real frame;

s4: improving an SSD network architecture, namely replacing 5 maximum pooling layers in a VGG16 network with a hole convolution layer respectively, setting the hole convolution layer to be 2 times of downsampling, replacing an FC6 layer and an FC7 layer in the SSD network with the hole convolution layer respectively, and replacing a conv8_2 layer and a conv9_2 layer with the hole convolution layer respectively to obtain the improved SSD network architecture;

s5: training the improved SSD network architecture by using the training set so that the position of the tomato in the color image can be identified;

s6: inputting the color image to be detected into the trained SSD network in S5, labeling the bounding box with the score larger than the threshold value, and selecting the area as the detected tomato area.

Preferably, in S1, images of tomatoes under various complex conditions are collected, and the training samples include different illumination, angles, and fruit sizes, a single fruit, multiple fruits, and the condition that a fruit is blocked by tomato leaves.

Preferably, the data enhancement includes flipping, shifting, cropping, color dithering, and noise enhancement.

Preferably, in S3, the tomato pictures in the training set are manually labeled by using a label file provided by the tool kit Labelimg.

Preferably, in S4, the 5-layer hole convolutional layers with 2 layers, 3 layers, 6 layers, 9 layers and 14 layers are down-sampled by 2 times to replace the down-sampling of the 5-layer maximum pooling layer in the original VGG16 network, thereby forming 4 residual blocks.

Preferably, in S4, the convolution kernel sizes of the two-layer hole convolution layers for replacing FC6 and FC7 are both 3 × 3, the padding is set to 0, the step size is set to 1, the number of intervals of the points of the convolution kernel is 1, and the final receptive field size is 7 × 7.

Preferably, in S4, the convolution kernel size of the hole convolution layer replacing the conv8_2 layer is 3 × 3, and the hole rate is 2; the convolution kernel size of the hole convolution layer for replacing the conv9_2 layer was 3 × 3, and the hole rate was 4.

Preferably, in S5, when training the improved SSD network architecture, framework parameters need to be configured in advance, the detection type is set to 1, the matching threshold is 0.5, and the number of training steps is 20000.

The invention changes the VGG network framework in the SSD network architecture from removing all the pooling layers into a residual error network, and the convolution layer of the residual error network uses the hole convolution. Because the size of the low-level feature map used in the SSD algorithm is larger, the size of the high-level feature map is smaller, the size of the low-level feature map is reduced by utilizing downsampling operation, and the maximum pooling layer is replaced by the hole convolution, so that the feeling of the feature map is expanded, and more effective global information can be obtained by the feature map elements. Therefore, the method for introducing the void convolutional layer in the image recognition stage can improve the tomato recognition accuracy in a complex environment and is beneficial to improving the whole machine working efficiency of the tomato picking robot.

Drawings

Fig. 1 is a flow chart of tomato recognition based on deep learning.

Fig. 2 is a modified diagram of the SSD network framework.

Fig. 3 is a diagram of a residual block network.

FIG. 4 shows the result of the detection in one embodiment.

Detailed Description

The invention is further illustrated by the following figures and specific examples.

Fig. 1 is a basic operation flowchart of a tomato recognition method based on deep learning according to a preferred embodiment of the present invention, which includes the following steps:

step 1, image acquisition

And (3) collecting color images of the tomatoes under outdoor natural illumination by using a color camera to construct training samples. When training samples are obtained, image samples under different conditions should be selected as much as possible under natural lighting conditions to enrich sample types. The image acquisition can be carried out by selecting different times and different illumination degrees in one day, and the image acquisition can be carried out as much as possible by selecting different illumination, angles and fruit size changes, and carrying out single fruit, a plurality of fruits, the fruits which are shielded by tomato leaves and the like.

Step 2, image data enhancement

And (3) performing data enhancement on the color image acquired in the step (1) to form a training set. The data enhancement is a method capable of improving algorithm robustness under the condition that detection precision is not reduced, under natural illumination, collected tomato image data are often too small in sample, and the image data enhancement is carried out by adopting a method comprising horizontal turning, translation, shaking and noise addition.

Step 3, labeling samples

And (3) manually labeling all the enhanced image samples in the step (2), labeling 4-tuple parameters (Xmin, Ymin, Xmax and Ymax) containing the region of interest (namely the position of the tomato), and respectively representing the coordinates (Xmin, Ymax) at the upper left corner and the coordinates (Xmax, Ymin) at the lower right corner of the labeling frame, thereby obtaining a labeling file comprising the coordinate parameters of the real frame. And when the label file is used for evaluating the accuracy of the model, the matching score is judged according to the coincidence rate of the output frame and the label frame. In this embodiment, the tomato pictures in the training set are manually labeled through a label file provided by the tool kit Labelimg.

Step 4, improved SSD network architecture construction

A traditional SSD object detection framework is based on a feedforward neural network, and the structure of the traditional SSD object detection framework is shown in a Single Shot MultiBox Detector [ C ]// European Conference on Computer Vision, 2016 (prior art) in Liu W, Anguelov D, Erhan D, et al.

The SSD framework uses the VGG16 of the full convolutional layer as a basic network to directly predict the multi-target classes and the peripheral frames by using the feature map, but the SSD network architecture needs to be improved in the invention so that the SSD network architecture can accurately identify the tomatoes in the image. In this embodiment, the basic structure of the SSD network is not changed, and only a part of the layers are replaced with the void convolution layers, which is specifically as follows:

in one aspect, the 5 largest pooling layers in the VGG16 network are each replaced with a hole convolutional layer, and the hole convolutional layers are set to 2-fold down-sampling. On the other hand, the FC6 layer and the FC7 layer in the SSD network are replaced with the hole convolutional layers, respectively, and the conv8_2 layer and the conv9_2 layer are replaced with the hole convolutional layers, respectively. After the above replacement, an improved SSD network architecture is obtained, and the network architecture can be seen in fig. 2.

The specific parameters of the void convolution layer can be adjusted as desired. In this embodiment, the number of layers of the hole convolution layer for replacing the 5 largest pooling layers in the VGG16 network is 2, 3, 6, 9, and 14 layers in this order. Therefore, in an improved SSD network framework, 5 maximum pooling layers are removed, 2-time down-sampling is carried out on 2, 3, 6, 9 and 14 layers of hole convolution layers to replace the down-sampling of the 5 maximum pooling layers in the original VGG16 network, a depth residual error network with 4 residual error blocks is formed, and the structure of the convolution block reduces convolution parameters needing to be calculated. The residual network structure can be seen in fig. 3.

In addition, the convolution kernel sizes of the two-layer hole convolution layers for replacing FC6 and FC7 are both 3 × 3, the padding is set to 0, the step size is set to 1, and the number of intervals of the dots of the convolution kernel is set to 1, so the final receptive field size is 7 × 7. The convolution kernel size of the hole convolution layer for replacing the conv8_2 layer is 3 × 3, and the hole rate is set to 2; the convolution kernel size of the hole convolution layer used to replace the conv9_2 layer was 3 × 3, and the hole rate was set to 4.

In the invention, the SSD network combined with the hole convolution is adopted, so that the information loss caused by pooling operation is avoided, the receptive field is enlarged under the condition of the same calculation condition, and each convolution output contains information in a larger range. The method for calculating the size of the characteristic graph obtained after the cavity convolution is as follows:

wherein i is the size of the input feature map; p is padding value; k is the convolution kernel size; d is an expansion factor, and the cavity convolution operation is completed on elements with the distance of d-l; s is the step size.

In the improved SSD network, the form of the loss function is substantially similar to that of the conventional SSD network. Generating a fixed-size set of bounding boxes and confidence levels for the object classes in the boxes in the multi-level feature map:

wherein c is the confidence of the Softmax function to each category, N is the number of matched default boxes, a weight term alpha is set to be 1 through cross validation, x is defined as a default box matching value, l is defined as a prediction box, and g is defined as a real tag value; l is_confDefined as the loss of confidence, L_locDefined as the loss of position.

Where the position penalty is the smoothed L1 penalty between the prediction box (L) and the true tag box (g), as shown in equation (2):

in this formula, pos is defined as the default box; box is defined as the coordinate of the center of the prediction frame and the width and height of the prediction frame; x is the number of_ijDefined as the matching value of the ith default box with the jth real tag box of category k.

In this embodiment, the SSD network extracts 4 or 6 default frames respectively according to different aspect ratios for each unit of the Conv4_3, FC7, Conv8_2, Conv9_2, Conv10_2 and Conv11_2 feature layers, and finally obtains 8732 default frames.

Step 5, model training

After the improved SSD network architecture is constructed, the improved SSD network architecture can be trained by using the training set and the label file obtained in the previous steps, so that the positions of tomatoes in the input color image can be identified. The frame parameters need to be configured in advance when the network is trained, and since only tomatoes need to be detected in this embodiment, the detection type can be set to be 1, the matching threshold value is 0.5, and the number of training steps is 20000.

And in the training stage, a formula (1) is used for carrying out regression analysis on coordinate deviation on the prediction frame, each default frame is matched with the mark frame in the testing stage in the overlapping rate, and the default frames are sorted from high to low according to the matching scores.

Because the size of the low-level feature map used in the SSD algorithm is larger, the size of the high-level feature map is smaller, the size of the low-level feature map is reduced by utilizing downsampling operation, and the maximum pooling layer is replaced by cavity convolution, so that the receptive field of the feature map is enlarged, and more effective global information can be obtained by feature map elements.

Step 6, identifying tomatoes

Inputting the color image to be detected into the trained SSD network, labeling the boundary box with the score larger than 50% of the threshold value, and selecting the area as the detected tomato area. In this embodiment, the test set is input into the network model, the detection result is significantly better than that of the conventional SSD network model, and fig. 4 shows the detected tomato bounding box in one of the pictures, which shows that the position of the tomato is accurately identified as a whole.

Therefore, the convolution layer is used for replacing the maximum pooling layer to carry out down-sampling, more feature information can be reserved, multiplexing and fusion of front and rear layer features are realized by introducing a residual error structure into a backbone network, and rapidness and accuracy of tomato fruit feature extraction are met.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. A tomato identification method based on deep learning is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein in S1, tomato images of multiple complex situations are collected, and training samples include different illumination, angle, and fruit size, single fruit, multiple fruits, and the situation that a fruit is blocked by tomato leaves.

3. The tomato recognition method based on deep learning as claimed in claim 1, wherein the data enhancement comprises flipping, shifting, clipping, color dithering and noise enhancement.

4. The method as claimed in claim 1, wherein in S3, the tomato pictures in the training set are labeled manually with a label file provided by a tool box, Labelimg.

5. The tomato recognition method based on deep learning of claim 1, wherein in S4, 5-layer hole convolutional layers with 2 layers, 3 layers, 6 layers, 9 layers and 14 layers are respectively used to perform 2-fold down-sampling to replace the down-sampling of 5 maximum pooling layers in the original VGG16 network, so as to form 4 residual blocks.

6. The tomato recognition method based on deep learning as claimed in claim 1, wherein in S4, the convolution kernel size of the two-layer hole convolution layer for replacing FC6 and FC7 is 3 × 3, the padding is set to 0, the step size is set to 1, the number of intervals of the convolution kernel points is 1, and the final receptive field size is 7 × 7.

7. The deep learning-based tomato recognition method of claim 1, wherein in S4, the convolution kernel size of the hole convolution layer replacing conv8_2 layer is 3 x 3, and the hole rate is 2; the convolution kernel size of the hole convolution layer for replacing the conv9_2 layer was 3 × 3, and the hole rate was 4.

8. The method as claimed in claim 1, wherein in S5, when training the improved SSD network architecture, framework parameters need to be configured in advance, the detection type is set to 1, the matching threshold is 0.5, and the training steps are 20000.