NL2025689B1

NL2025689B1 - Crop pest detection method based on f-ssd-iv3

Info

Publication number: NL2025689B1
Application number: NL2025689A
Authority: NL
Inventors: He Yong; Zeng Hong; Wu Jianjian; Xu Jian
Original assignee: Univ Zhejiang
Priority date: 2019-05-31
Filing date: 2020-05-27
Publication date: 2021-06-07
Also published as: CN110222215B; NL2025689A; CN110222215A

Abstract

The present invention discloses a crop pest detection method based on Feature Fusion Single Shot Multibox Detector Inception V3 (F-SSD-IV3), including the following steps: (1) capturing pest images to construct a crop pest database; (2) constructing an F-SSD-IV3 target detection algorithm, using lnception V3 to replace VGG-16 as a feature extractor, designing a feature fusion method to conduct fusion on context information for output feature maps of different scales, and finally fine-tuning a candidate bound by using Softer NMS; and (3) optimizing a network during training, and improving detection performance and a model generalization capability by using a method of amplifying data and adding a Dropout layer.

Description

P3457ONLOO/TRE Title: CROP PEST DETECTION METHOD BASED ON F-SSD-IV3

TECHNICAL FIELD The present invention belongs to the field of deep learning and computer vision, and in particular, to a crop pest detection method based on Feature Fusion Single Shot Multibox Detector Inception V3 (F-SSD-1V3).

BACKGROUND With the continuous growth of the global population, a grain demand is also increasing dramatically. Due to the influence of the natural environment and factors of crops, the crops are inevitably attacked by pests at different growth stages. If the pests cannot be detected and eliminated in time, an outbreak of pests may occur. A large-scale outbreak of pests will affect the healthy growth of crops, thereby greatly reducing a yield and quality of the crops.

Conventional pest identification is based on morphological features such as a morphology, a color, and a texture, and relies on an artificial identification method. As a result, there exists specific subjectivity, poor timeliness, or labor intensity. Early identification of pests is based on a template matching technology and a simple model, and a feature of a pest image is extracted by using an artificially designed feature. Common features include a histogram of oriented gradient (HOG), a local binary pattern (LBP), scale-invariant feature transform (SIFT), Haar-like, a deformable parts model (DPM), etc. However, the artificially designed feature depends on priori knowledge. Therefore, it is difficult to accurately express a color and a morphology of a target pest, and there is a lack of robustness. In addition, application scenarios of the above-mentioned methods have limitations and are only suitable for an ideal laboratory environment.

In recent years, relying on a powerful feature expression capability of a convolutional neural network (CNN), a target detection method based on deep learning has made great breakthroughs in detection performance. In general, the target detection method based on deep learning can be divided into two types: a target detection method based on a candidate region and a target detection method based on regression. In the target detection method based on a candidate region, an algorithm process includes generating a candidate region in an image, extracting a feature from the candidate region to generate a region of interest (Rol), and finally conducting classification and regression. Common algorithms include R-CNN[53], Fast R- CNN[54], Faster R-CNN[55], and R-FCN[56]. Such a method has relatively high accuracy but has a low detection speed. Currently, a main trend of object detection is faster and more efficient detection. Target detection methods based on regression such as YOLO [59] and SSD [60] have an obvious advantage of a high detection speed. For an input image, a bounding box and a

-2- category thereof are predicted at multiple positions of the image at the same time when there is no candidate region. A limitation of YOLO lies in a strong spatial constraint on the prediction of a bounding box. Therefore, it is difficult to detect a multiple-scale small target object. In terms of a detection speed, SSD can basically achieve real-time performance, but it has relatively poor detection performance when being used for a small target object. In an actual field environment, there is a complex background and diverse pest types/postures, and a target size in an obtained pest image is relatively small. Consequently, existing detection methods cannot well satisfy a need of the crop pest detection field.

SUMMARY To resolve a problem that existing detection methods cannot well balance a contradiction between a detection speed and detection accuracy, based on characteristics of existing pest images: a small number of samples, small target objects, diverse posture changes, and being easy to be blocked, the present invention proposes a new F-SSD-IV3 target detection method for crop pest detection, to improve an SSD target detection algorithm.

To achieve the foregoing objective, the present invention provides the following technical solution, including the following steps, as shown in FIG. 1: (1) Capture pest images through internet downloading, smartphone shooting, digital camera shooting, etc. to construct a crop pest database.

(1-1) Set all the RGB pest images to images in a JPEG format, and name the images with pest names and continuous numbers.

(1-2) Label a category of a pest and a rectangular boundary box in the image by using an image annotation tool Labellmg, where the rectangular boundary box is formed by four pieces of coordinate information: xmin, ymin, xmax, and ymax.

(2) Construct an F-SSD-IV3 target detection algorithm, use Inception V3 to replace VGG- 16 as a feature extractor, design a feature fusion method to conduct fusion on context information for output feature maps of different scales, and finally fine-tune a candidate bound by using Softer NMS, where the method is shown in FIG. 2, and a detailed process includes the following: (2-1) Select the Inception V3 as a basic network of the F-SSD-IV3, where a structure of an Inception V3 network is shown in FIG. 3, and includes a convolutional layer, a convolutional layer, a convolutional layer, a pooling layer, a convolutional layer, a convolutional layer, a pooling layer, Mixed1_a, Mixed1_b, Mixed1_c, Mixed2_a, Mixed2_b, Mixed2_c, Mixed2_d, Mixed2_e, Mixed3_a, Mixed3_b, Mixed3_c, a pooling layer, a dropout layer, and a fully connected layer; a size of an input image is 300x300x3; dimensions of convolution kernels include 1x1, 1x3, 3x1, 3x3, 5x5, 1x7, and 7x1; the pooling layer includes maximum pooling and average pooling, and has a dimension of 3x3; and sizes of the obtained feature maps are 149x149x32, 147x147x32,

-3- 147x147 x64, 73x73xB4, 73x73x80, 71x71x192, 35x35x192, 35x35x256, 35x35x288, 35x35x288, 17x17x768, 17x17x768, 17x17x768, 17x17x768, 17x17x768, 8x8x1280, 8x8x2048, 8x8x2048, and 1x1x2048.

(2-2) Then add an additional network of the six convolutional layers after the Inception V3, where sizes of convolution kernels are respectively 1x1x258, 3x3x512 (a step is 2), 1x1x128, 3x3x256 (a step is 2), 1x1x256, and 3x3x128 (a step is 1); and obtain three feature maps with sizes gradually decreased, where sizes thereof are respectively 4x4x512, 2x2x256, and 1x1x128.

(2-3) Conduct feature fusion on the feature maps output in step 2-2, a Mixed1_c feature map, a Mixed2_e feature map, and a Mixed3_c feature map, to resolve a problem that it is difficult to detect a small target object in a later stage of an original SSD target detection method due to a serious lack of global context information, where the feature fusion method is shown in FIG. 3, and specifically includes first conducting deconvolution on a feature map at a next layer, then conducting feature fusion on the feature map at the next layer and a feature map at a current layer in a cascading manner, and outputting a new feature map; and an output candidate bound in the network structure can be represented as the following formula: Output candidate bound ={F, , (ft). ee ( A) Je = Joa = ht 1 Jo = ht ott fi n>k>0 / represents a feature map output at a cascaded n! layer, and P represents a candidate bound generated for each feature map.

"+" in FIG. 4 represents a cascading module formed by a deconvolution layer, 3x3 convolution layers, and a 1x1 convolution layer, and can transfer an advanced feature to a lower layer. To combine feature maps of different sizes, the cascading module uses the deconvolution layer to generate and input feature maps with a same height and width; then uses two 3x3 convolution layers to better learn features; and uses a standardized layer before connection to conduct normalization processing on the input feature maps. Normalization can resolve a problem of gradient explosion, and can greatly increase a training speed during network training. Concat can combine two feature maps. Other dimensions of the two feature maps are same except a stitching dimension. The 1x1 convolutional layer is introduced for dimensionality reduction and feature recombination.

(2-4) Conduct convolution on k candidate bounds at each position in a mxn feature map, where a size of a convolution kernel is (c+ Dl predict c category scores and four position

-4- changes, and finally generate ™ X10 Xk(¢ +4) predicted outputs. For the candidate bound of the feature map, a minimum scale is Sin =0.2, and a maximum scale is S as =0.9. In the present invention, S in =0.1, and Sas =0.95. In this case, a size range of the candidate bound of the Se +S feature map is larger. To ensure smooth scale transition between layers, a hew scale 2 is added for a feature map at each layer in the present invention, so as to improve the detection a, €{1,2,3~,5} accuracy. In addition, a default aspect ratio of a candidate bound is set to 23. When 4 =1, an extra candidate bound is added, and a size thereof is Sk = NSS (2-5) During detection conducted by using an original SSD algorithm, use the NMS to preserve a candidate bound with a relatively high confidence coefficient, and generate a large number of candidate bounds (24,564 candidate bounds are generated by using SSD512) between which an overlap exists; (1) A candidate bound is selected by using the Soft NMS for each candidate bound. (2) For each selected candidate bound M, whether an loU of another candidate bound and the candidate bound M is greater than a threshold p is determined. (3) Weighted averaging is conducted on all candidate bounds whose loUs are greater than the threshold p, and position coordinates of the candidate bounds are updated. (2-8) A loss function of the SSD is formed by two parts: a position loss Line and a classification loss Loong and can be represented as follows: 1 L(x,c,l,g)= 7 Leos (x.c)+al,. (x1,8)) where N represents the number of candidate bounds matching a real boundary, cis a confidence coefficient of each type of candidate bound, /is a value of a translation and scale change of a candidate bound, g is position information of the real boundary, and a=1 by default. (3) Optimize a network during training, and improve detection performance and a model generalization capability by using a method of amplifying data and adding a Dropout layer. (3-1) A data set of pests is relatively small, it is relatively difficult to obtain new data, and relatively high costs are required to obtain a data set with sufficient labels. Therefore, a data amplification method is adopted in the present invention to expand a data set. Data amplification can be represented as the following formula:

DST S represents raw training data, 7’ represents data obtained after data amplification, and Ö is the adopted data amplification method.

-5.

In the present invention, a common data amplification manner is adopted to randomly adjust luminance, a contrast ratio, and saturability of an image and conduct flipping, rotation, cropping, and translation on the image. Finally, the training set is expanded by five times.

(3-2) The Dropout policy can prevent a problem of model overfitting. During network training, some neurons at a hidden layer are randomly inhibited in each iteration at a probability p, and finally a comprehensive averaging policy is used to combine different neural networks as a final output model. In the present invention, probabilities of randomly inhibiting some neurons at the hidden layer are p=0.5, 0.6, 0.7, 0.8, 0.9.

BRIEF DESCRIPTION OF DRAWINGS FIG. 1 is a step diagram of a detection method according to the present invention; FIG. 2 is a flowchart of an F-SSD-IV3 algorithm; FIG. 3 is a network structure diagram of Inception V3; and FIG. 4 is a schematic diagram of a feature fusion method.

DETAILED DESCRIPTION The present invention is described in detail below with reference to embodiments and the accompanying drawings, but the present invention is not limited thereto.

(1) Experimental data: In the present invention, a field crop typical-pest data set collected by the Institute of Agricultural Information Technology, Zhejiang University is adopted, and pest images in the data set include information such as different image sizes, light conditions, blocking degrees, shooting angles, and target pest sizes. Images in a database are randomly and evenly distributed in a training set, a verification set, and a test set at a ratio of 7:2:1. A model is trained by using data in the training set, evaluation is conducted by using the validation set to select a model parameter, and finally model performance and efficiency are detected by using the test set.

(2) Experimental environment: Specifications of an experimental workstation are as follows: Memory is 32GB, an operating system is Linux Ubuntu 18.04, and a CPU is Intel Core i7 7800X. TensorFlow supports multi-GPU training. A total of two NVIDIA GeForce GTX 1080Ti graphics cards are used for training in the present invention. Python is used as a programming language because it can support a TensorFlow deep learning framework.

(3) A training process: First, data amplification is conducted to expand the training set, and a size of an input image is fixed at 300x300x3. Then a network is initialized, errors of a position loss function and a classification loss function are calculated through forward propagation, and parameters are updated through backpropagation until 200,000 iterations are completed, and finally the parameters are saved. In the experiment, a model Inception V3 trained on ImageNet is used as a feature extraction network of SSD through fine-tuning, and parameters of the

-6- Inception V3 are used to initialize parameters of a basic network to speed up a training speed of the network. Training hyperparameters are as follows: a random number of standard normal distribution with a standard deviation of 0.1 and a mean of 0 is generated through initialization. A stochastic gradient descent (SGD) method of Momentum is used, a weight is 0.9, and an attenuation coefficient is also set to 0.9. Compared with SGD, a Momentum optimizer resolves two problems: noise introduction and relatively large convergence oscillation. An initial learning rate is set to 0.004, an exponential attenuation parameter is set to 0.95, and a batch size is set to

24. A total of 200,000 iterations are conducted, and one complete training operation is conducted for approximately 20 hours. During training, when an loU of a candidate bound and a labeled rectangular box exceeds 0.6, the candidate bound is a positive sample; otherwise, the candidate bound is a negative sample.

(4) Parameters of the model are continually adjusted according to a result of the verification set, and the test set is applied to a trained optimal model to determine the performance of the model. When p of a Dropout layer is 0.8, an mAP value is the highest. An F- SSD-1V3 algorithm proposed in the present invention is compared with original SSD300, Faster R-CNN, and R-FCN target detection algorithms based on a same test set, and a target detection standard performance evaluation indicator mAP proposed in the Pascal VOC Challenge is used as a performance indicator.

Table 1 Performance comparison of various algorithms F-SSD- Detecti It can be learned from the foregoing table that, SSD300 has the best detection speed, namely, 0.048 seconds per single image, but the detection accuracy is the lowest; the detection accuracy of both Faster R-CNN and R-FCN is lower than 0.68, and Faster R-CNN and R-FCN can detect a single image by approximately 0.15 seconds. Compared with R-FCN and Faster R- CNN, F-SSD-IV3 has relatively larger advantages in the detection accuracy and a detection speed. Therefore, F-SSD-IV3 proposed in the present invention can better balance the detection accuracy and the detection speed. The present invention has a relatively high practical value for real-time and accurate detection of pests in a field environment.

The foregoing descriptions are merely preferred examples of the present invention, but are not intended to limit the present invention. Any modifications, equivalent replacements or

-7- improvements made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

-8-

CONCLUSIONS

1. A crop pest detection method based on Feature Fusion Single Shot Multibox Detector Inception V3 (F-SSD-IV3), which includes the following steps: (1) capturing images of the pest to build a crop pest database; (2) constructing an F-SSD-IV3 target detection algorithm, performing feature maps at different scales using the images in the crop pest database and using Inception V3 as a feature parser, performing feature fusion on the feature maps, and refining a candidate boundary using Softer NMS; and (3) optimizing a target detection network by amplifying data and adding a drop-out layer to obtain an optimal detection model used to detect crop pests in an image.

The crop pest detection method based on F-SSD-IV3 according to claim 1, wherein the crop pest database stores pest images of different sizes, lighting conditions, blocking degrees, shooting angles and target pest sizes.

The F-SSD-IV3 based crop pest detection method according to claim 1, wherein step (2) specifically comprises the following steps: (2-1) selecting Inception V3 as a base network of the F-SSD-IV3, wherein a structure of an Inception V3 network includes a convolution layer, a convolution layer, a convolution layer, a pooling layer, a convolution layer, a convolution layer, a pooling layer, Mixed1_a, Mixed1_b, Mixed1_c, Mixed2_a, Mixed2_b, Mixed2_c, Mixed2_d, Mixed3_a, Mixed3_b, Mixed3_b comprises a pooling layer, a drop-out layer and a fully connected layer; wherein dimensions of convolution kernels include 1x1, 1x3, 3x1, 3x3, 5x5, 1x7 and 7x1; wherein the pooling layer comprises maximum pooling and average pooling, and has a size of 3x3; and wherein sizes of the resulting feature maps are 149x149x32, 147x147x32, 147x147x64, 73x73x64, 73x73x80, 71x71x192, 35x35x192, 35x35x256, 35x35x288, 35x35x288, 17x17x768, 17x17x768, 17x17x768, 8x17x8 (2-2) then adding an additional network of the six convolution layers after the Inception V3, with convolution kernel sizes being 1x1x256, 3x3x512, 1x1x128, 3x3x256, 1x1x256, and 3x3x128, respectively, and getting three feature maps with sizes that gradually decrease, the sizes thereof being 4x4x512, 2x2x256 and 1x1x128, respectively;

-9-(2-3) performing feature fusion on the feature maps performed in step 2-2, a Mixed1_c feature map, a Mixed2_e feature map, and a Mixed3_c feature map, and outputting a new feature map; (2-4) conducting convolution on k candidate boundaries bound at any position in an mxn feature map, where a size of a convolution kernel is (c+4)k, predicting c category scores and four positional changes, and finally ™ X10 xk (c +4) generates predicted output values; (2-5) using the NMS to maintain a candidate boundary with a relatively high certainty coefficient, and generating a large number of candidate boundaries that overlap; selecting a candidate boundary using the Soft NMS for each candidate boundary; determining, for each selected candidate boundary M, whether a lU of another candidate boundary and the candidate boundary M is greater than a threshold p; and performing weighted averaging on all candidate boundaries whose loUs are greater than the threshold p, and updating position coordinates of the candidate boundaries; and (2-6) where a loss function of the SSD is formed by two parts: a position loss Lie and a classification loss Loong ‚ and can be represented as follows: 1 L(x, ¢,l,g) = To (x,c ) + al, (x,..8)) where N represents the number of candidate boundaries that correspond to an actual boundary, cc is a certainty coefficient of each type of candidate boundary, / is a value of a displacement and scaling of a candidate boundary, g position- information is from the actual boundary, and by default a = 1.

A crop pest detection method based on F-SSD-IV3 according to claim 3, wherein in step (2-3) the feature fusion method comprises first performing deconvolution on a feature map on a next layer, then performing feature fusion on the feature map on the next layer and includes a feature map on a current layer in a cascading manner, and includes executing a new feature map.

The crop pest detection method based on F-SSD-IV3 according to claim 3, wherein an output value for the candidate boundary in the network structure can be represented as the following formula:

-10 - Output candidate bound ={P,_, (£4): 4 (4) zh fon = fit fh Jin = Jy + Ju +... + Jur n>k=>0 where 7 is a feature card output on represents a cascading n® layer, and P represents a candidate boundary generated for each feature map.

Crop pest detection method based on F-SSD-IV3 according to claim 5, wherein an 11 a €{1,2,3,—,—} default aspect ratio of a candidate boundary is set to 23 and when “r=1, an additional candidate boundary is added, and a magnitude thereof is St NSS jg

The crop pest detection method based on F-SSD-IV3 according to claim 1, wherein data amplification in step (3) is represented as the following formula:

OST where S represents raw training data, / represents data obtained after data amplification, and J is an assumed data amplification method; and brightness, contrast ratio, and saturation of an image are arbitrarily adjusted, and mirroring, rotation, cropping, and displacement are applied to the image.

The crop pest detection method based on F-SSD-IV3 according to claim 1, wherein the Dropout policy is as follows: that during network training, single neurons on a hidden layer are randomly suppressed with a probability p with each iteration, and that finally a comprehensive averaging policy is used to combine different neural networks as a final output model.