CN112149518A

CN112149518A - Pine cone detection method based on BEGAN and YOLOV3 models

Info

Publication number: CN112149518A
Application number: CN202010912858.0A
Authority: CN
Inventors: 张怡卓; 于慧伶; 蒋大鹏; 张健; 罗泽; 葛奕麟
Original assignee: Jiangsu Shengdong Technology Development Co ltd
Current assignee: Jiangsu Shengdong Technology Development Co ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2020-12-29

Abstract

The invention discloses a pine cone detection method based on BEGAN and YOLOV3 models, which comprises the steps of firstly collecting pine cone images under a plurality of different time nodes, and performing data enhancement by using a traditional image enhancement technology and a BEGAN deep learning method; dividing the pine cone image subjected to data enhancement into a plurality of cells by using a YOLOV3 model, and introducing a dense connection network structure modified by using a constructed bottleneck structure dense layer into the YOLOV3 model; the detection proportion of the YOLOV3 model is expanded, and a DIoU algorithm is utilized to optimize a loss function of the YOLOV3 model, so that the overall performance of pine cone detection can be effectively improved.

Description

Pine cone detection method based on BEGAN and YOLOV3 models

Technical Field

The invention relates to the technical field of pine cone recognition, in particular to a pine cone detection method based on BEGAN and YOLOV3 models.

Background

The real-time detection of the pine cone in the Korean pine forest is not only a data basis for realizing the mechanized picking of the pine cone, but also one of the important methods for evaluating the yield of the Korean pine forest. In recent years, some detection accuracy has been provided for image processing of fruits in trees by using deep learning methods, but these methods have the disadvantages of less reference of detection data, slower speed and low detection accuracy, which results in low overall performance of pine nut detection.

Disclosure of Invention

The invention aims to provide a pine cone detection method based on BEGAN and YOLOV3 models, which improves the overall performance of pine cone detection.

In order to achieve the above object, the present invention provides a pine cone detection method based on the model of BEGAN and YOLOV3, comprising:

collecting pine cone images under a plurality of different time nodes, and performing data enhancement by using a traditional image enhancement technology and a BEGAN deep learning method;

dividing the pine cone image after data enhancement by using a YOLOV3 model, and introducing a dense connection network structure in the YOLOV3 model;

expanding the detection proportion of the Yolov3 model, and optimizing a loss function of the Yolov3 model by using a DIoU algorithm.

Dividing the pine cone image after data enhancement by using a Yolov3 model, and introducing a dense connection network structure in the Yolov3 model, wherein the method comprises the following steps:

dividing the input pine cone image into a plurality of cells by using a YOLOV3 model, and acquiring a plurality of bounding box information and 5 data values corresponding to the bounding box information by taking the cells with pine cones as units.

The method includes the steps of dividing the pine cone image after data enhancement by using a YOLOV3 model, and introducing a dense connection network structure into the YOLOV3 model, and further includes the following steps:

and dividing 23 residual modules in a backbone network used by the YOLOV3 model into 5 groups, modifying a dense connection network structure by using the constructed bottleneck structure dense layer, and replacing any three groups of residual modules in the 5 groups of residual modules.

The method comprises the steps of expanding the detection proportion of the YOLOV3 model, and optimizing a loss function of the YOLOV3 model by using a DIoU algorithm, wherein the method comprises the following steps:

and respectively carrying out up-sampling on 32-time down-sampling, 16-time down-sampling and 8-time down-sampling, and then connecting the up-sampling with the output of the second group of residual error modules in the YOLOV3 model after the dense connection network structure is introduced to obtain a feature fusion target detection layer with 4-time down-sampling.

Wherein, expanding the detection proportion of the YOLOV3 model, and optimizing the loss function of the YOLOV3 model by using a DIoU algorithm, further comprises:

and constructing the coordinate error, the confidence error and the classification error into a loss function of the YOLOV3 model, and optimizing the loss function according to the Euclidean distance and the diagonal distance of the central point between the corresponding prediction frame and the target frame.

The invention relates to a pine cone detection method based on BEGAN and YOLOV3 models, which comprises the steps of firstly collecting pine cone images under a plurality of different time nodes, and performing data enhancement by using a traditional image enhancement technology and a BEGAN deep learning method; dividing the pine cone image subjected to data enhancement into a plurality of cells by using a YOLOV3 model, and introducing a dense connection network structure modified by using a constructed bottleneck structure dense layer into the YOLOV3 model; the detection proportion of the YOLOV3 model is expanded, and a DIoU algorithm is utilized to optimize a loss function of the YOLOV3 model, so that the overall performance of pine cone detection can be effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic step diagram of a pine cone detection method based on the models of BEGAN and YOLOV3 according to the present invention.

Fig. 2 is a schematic structural diagram of a BEGAN network provided by the present invention.

Fig. 3 is a schematic structural diagram of the backbone network provided by the present invention after being divided.

Fig. 4 is a schematic structural diagram of a basic dense layer provided by the present invention.

Fig. 5 is a schematic diagram of a dense connection network structure after a bottleneck structure dense layer is added.

Fig. 6 is a diagram of an improved darknet-53 backbone network architecture provided by the present invention.

FIG. 7 is a schematic diagram of a method for improving a scale detection module provided by the present invention.

FIG. 8 is a schematic structural diagram of the entire detection method provided by the present invention.

FIG. 9 is a P-R graph obtained from recall and accuracy provided by the present invention.

Fig. 10 is a graph of YOLOV3 and original YOLOV3AP after the dense connection network is introduced.

FIG. 11 is a comparison of the expanded detection scale provided by the present invention.

FIG. 12 is a comparison graph of the loss function using the DIoU optimization provided by the present invention.

FIG. 13 is a graph of data enhancement contrast provided by the present invention.

FIG. 14 is a P-R plot of different data sets provided by the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Referring to fig. 1, the present invention provides a pine cone detection method based on the models of BEGAN and YOLOV3, including:

s101, collecting a plurality of pine cone images under different time nodes, and performing data enhancement by using a traditional image enhancement technology and a BEGAN deep learning method.

Specifically, a camera is used for acquiring at 5312x2988 pixel resolution, the acquired images are manually annotated for experiments, the acquisition site is located in a forest farm, image data are collected in cloudy days and sunny days respectively, and the acquisition time comprises 8 am, 1 pm and 3 pm. Some images are collected in different angles of the same position, taking into account the recognition performance at different angles. 800 pine cone images are collected from an original data set; the data enhancement uses two methods of traditional image enhancement technology and BEGAN deep learning.

The GAN network model consists of two networks, generator g (generator) and arbiter d (discriminator). The generator G generates an image G (z) by receiving a random noise z, the function of the D network is to distinguish whether an image is real or not, the input parameter of the D network is an image x, and the output D (x) represents the probability that x is a real image. Through the training of the countermeasure process, the generator G and the discriminator D carry out a very small and very large game, and finally Nash balance is achieved. In an optimal state, G can generate an image G (z) which is enough to be spurious, and the hyper-parameter D (G (z)) is 0.5, so that a trained GAN network is obtained, and data enhancement can be realized by generating the image through the generator G.

In the original GAN, the data distribution generated by the generator is expected to be as close to the real data distribution as possible, and when the generated data distribution is equal to the real data distribution, the generator G is determined to generate the same sample as the real data distribution through training, that is, the capability of generating data enough to falsely and truly confuse is obtained, so from this point, researchers design various loss functions to make the generated data distribution of G as close to the real data distribution as possible. Instead of this method of estimating probability distribution, the difference between the generated data distribution pg and the real data distribution px is calculated directly, and the distance between the errors of the two distributions is calculated. To estimate the error of the computation distribution, the BEGAN employs a network of discriminators of a self-encoder structure, the network structure of which is shown in FIG. 2.

The BEGAN network controls the quality and diversity of the generated images by a hyper-parameter gamma ranging from 0 to 1, with the diversity of the generated images being better as gamma increases, but the image quality also decreasing, where gamma is taken to be 0.4. The size of a pine cone image generated by a BEGAN network is 64x64, all pine cone samples in an acquired original data image are extracted, the size of the pine cone image is uniformly changed to 64x64 to construct a training data set of the BEGAN network, the pine cone image samples generated by the BEGAN network respectively comprise 64 different pine cone samples, the comparison with the pine cone samples of the original image shows that the image generated by the BEGAN network completely inherits the characteristics of a real image, the image generated by the BEGAN network is a single pine cone, each picture is sequentially extracted from the original data set, the image generated by the BEGAN network is adjusted in size to replace the pine cone image in the original picture, and the replaced image is placed back to the data set. In this manner, the present invention creates an augmented data set containing 1600 images.

S102, dividing the pine cone image after data enhancement by using a YOLOV3 model, and introducing a dense connection network structure into the YOLOV3 model.

Specifically, YOLOV3 is a one-stage detection algorithm that does not require a proposed region phase, but rather generates bounding box coordinates and probabilities for each category directly by regression. During the operation of the network, YOLOV3 divides the input picture into S × S cells, and then the output is performed in units of cells. If the center of an object, i.e., the pine cone, falls on a cell, then that cell is responsible for predicting the object. B bounding box information needs to be predicted per cell. The bounding box information contains 5 data values, x, y, w, h and confidence.

(x, y) is the displacement of the center point of the bounding box with respect to the cell, and the final predicted (x, y) is normalized. Suppose the width w of a picture_iHeight of h_iCenter coordinates (x) of bounding box_c,y_c) Cell coordinate is (x)_col,y_row) Then (x, y) is calculated as follows:

wherein, (w, h) represents the ratio of the bounding box to the whole picture, and the width and height of the predicted bounding box are assumed to be (w)_b,h_b) Then (w, h) is calculated as follows:

the confidence is composed of two parts, namely whether a target exists in the grid or not and the accuracy of the bounding box. The confidence is calculated as follows:

if the bounding box contains an object, then pr (object) is 1, otherwise pr (object) is 0;

for the area of intersection of the predicted bounding box and the real region of the object, this value is also at [0,1 ]]In the above paragraph.

In addition to confidence, each box also outputs C pieces of probability information that the object belongs to a certain class, so the output dimension of the final network is S × (B × 5+ C).

The backbone network Darknet-53 network used by the YOLOV3 is composed of 23 residual modules in total, each residual module is composed of two convolution layers and a shortcut link, the residual modules are divided into 5 groups, each group respectively comprises 1, 2, 8, 4 residual modules, as shown in FIG. 3, the dense connection network structure is modified by the constructed bottleneck structure dense layer, and any three groups of residual modules in the 5 groups of residual modules are replaced, wherein the modification of the dense connection network structure by the constructed bottleneck structure dense layer is specifically that: a densely connected network can improve the information flow and gradient throughout the network, the principle of which is as follows: suppose the input is X₀Each layer of the network implements a non-linear transformation H_i(.) where i represents the ith layer. Let the output of the i-th layer be denoted X_iThen:

X_i＝H_i(X₀,X₁,,,,,,X_i-1)

dense-connected networks typically comprise a plurality of dense modules, one dense module consisting of n dense layers. The specific structure of the basic dense layer is shown in fig. 4: unlike the common post-activation mechanism, the dense layer uses a pre-activation mechanism, which is a batch normalization layer and an activation function layer (ReLU) and performs activation operation before convolution layer and then performs 3 × 3 convolution output feature mapping.

Assume an input X of dense modules₀The dimension of (a) is m, each dense layer outputs k feature maps, and according to the principle of a dense network, the input m + (n-1) × k of the nth dense layer is one feature map, so that the direct convolution operation of 3 × 3 brings huge calculation amount. The bottleneck structure can be adopted to reduce the calculation amount, and the main method is to add 1x1 convolution layers in the original dense module to reduce the feature number. In the bottleneck structure dense layer we build, we get 2k feature maps through 1x1 convolutional layer first, and then output k feature maps through 3x3 convolutional layer, as shown in fig. 5.

In order to take account of the detection speed and the detection accuracy, residual modules with the outputs of 208x208 and 104x104 in the original darknet-53 network are reserved, three groups of residual modules with the outputs of 52x52, 26x26 and 13x13 are replaced by dense modules, each dense module consists of 4 bottle neck structure dense layers, and finally, the network output dimension is consistent with the original darknet-53 network as shown in fig. 6.

S103, expanding the detection proportion of the YOLOV3 model, and optimizing a loss function of the YOLOV3 model by using a DIoU algorithm.

Specifically, for most convolutional neural networks, shallow features are required to distinguish small targets, and deep features are required to distinguish large targets. YOLOV3 uses the concept of multi-scale feature fusion of FPN to detect three scales with feature map sizes of 13 × 13, 26 × 26, and 52 × 52, and makes the feature map pass through two adjacent scales by 2 times of upsampling. We improved the scale detection module in YOLOV3 for the case where the pine cones to be detected are mostly small targets. The original YOLOV3 was tested a total of three times, 32 times down-sampling, 16 times down-sampling, and 8 times down-sampling. The feature map in the network when 4 times down sampling contains more fine-grained features and position information of small targets, and the feature map and the high-level feature map are fused for detection, so that the precision of detecting the small targets can be improved.

The method for improving the scale detection module is shown in fig. 7: the feature graph in the original network during three times of detection is up-sampled, then the feature graph is connected with the output of a second group of residual error modules in the Yolov3, a feature fusion target detection layer sampled by 4 times is established in this way, and the three detection ratios in the original Yolo v3 are expanded to four.

The loss function can be used to evaluate a model of H3, and the loss function of YOLOV3 uses a binary cross entropy, consisting of three parts, coordinate error, confidence error, and classification error:

loss＝loss_coord+loss_noobj+loss_classes

wherein, the coordinate error comprises frame central point error and the wide high error of frame two parts:

the confidence error consists of two parts, namely the confidence error when an object exists in the prediction frame and the confidence error when no object exists in the prediction frame:

the classification error is expressed as follows:

in the above loss function, λ represents a weight. The coordinate error is a large proportion of the total loss, λ_coordSet to 5. In the confidence error, λ when an object is present in a frame is predicted to represent the difference between the presence and absence of the object _obj1, predicting λ when there is no object in the frame_noobj0.5, coefficient of the classification error term λ_classesFixed to 1.

The confidence error in the YOLOV3 loss function is calculated based on IoU, IoU represents the intersection ratio of the prediction box and the target box, when the prediction box is a and the target box is B:

IoU is widely used in the target detection task as an evaluation index, but has some disadvantages: if the prediction box and the target box do not intersect, IoU can be obtained as 0 according to the definition, which cannot reflect the distance between the prediction box and the target box, and simultaneously, the position error and the confidence error in the loss function cannot return a gradient, which affects the learning training of the network; when the intersection areas of the target frame and the prediction frame are equal and the distances are unequal, the calculated IoU are equal, which cannot accurately reflect the coincidence ratio of the target frame and the prediction frame, and the performance of the network is also reduced. In order to solve the problem, the method for calculating the GIoU is obtained by calculating the minimum convex set of a prediction box and a target box, and if the minimum convex set of A and B is C:

when A and B are not coincident, the farther they are, the closer the GIoU approaches-1, so the loss function can be expressed using 1-GIoU, which better reflects the coincidence of A, B. But when a is within B, GIoU will be completely degraded to IoU. Aiming at the GIoU algorithm, a DIoU improvement method is also provided:

in the above loss function, b^gtRepresents the center points of A and B, and p represents B and B^gtC represents the diagonal distance of the smallest rectangle that can cover both a and B. DIoU can directly minimize the distance between a and B and therefore converges much faster than GIoU. The DIoU inherits the excellent characteristics of IoU and avoids the disadvantage of IoU, and is a good choice in the 2D/3D computer vision task based on IoU as an index, and the DIoU is introduced into the loss function of YOLOV3 to improve the detection accuracy.

The experimental scheme of our proposed work is shown in fig. 8, and we perform data enhancement on the collected raw data through a BEGAN network, then perform as much convergence training as possible on the improved YOLOV3 model, and finally check the visual detection result and evaluation index of the model.

The experimental model herein was constructed based on Pytorch. The environment and parameters of model training are as follows: i78750H (cpu), 16G Random Access Memory (RAM), Nvidia 1070(GPU) and ubuntu18.04 operating system. Training and testing sets of models were as per 8: 2, images were scaled 416 × 416 before training, and the initialization parameters of the network are shown in table 1:

TABLE 1 network initial parameters

To verify the performance of the proposed method, a comparative experiment was performed on the collected original data set with the modified YOLOV3 and the original YOLOV3, SSD, Faster R-CNN, the F1 and the inspection speed of the four models are shown in table 2, where the F1 value is taken as the maximum value and the inspection speed is taken as the average speed over the entire data set.

TABLE 2F 1 and test speed for the four models

The improved YOLOV3 model obtains an F1 value of 0.923 and a detection speed of 7.9ms, which are respectively improved by 1.31 percent and 38.2 percent compared with the original YOLOV3 and are obviously better than SSD and Faster R-CNN. The P-R curves obtained from recall and accuracy during training are shown in fig. 9, and the P-R curve of the improved YOLOV3 model has a distinct advantage over other models in that its equilibrium point is closer to the coordinates (1, 1), indicating a higher performance of the model.

Three improvements were made to the original YOLOV3 model, and to explore the effect of different improvements on the model, we performed several comparative experiments, with only one improvement added to each experiment.

After dense modules are introduced into a backbone network of the YOLOV3, the computation of the model is greatly reduced, the computation of YOLOV3 after the dense modules are introduced is 40.48BFLOPs, and the computation of the original YOLOV3 is 65.86BFLOPs, which is a main reason for improving the detection speed. As shown in fig. 10, the YOLOV3 model AP value of the dense module is similar to the original YOLOV3 model, which indicates that the detection accuracy is not greatly affected by the dense module.

As shown in fig. 11, the AP value of the dense module model is only introduced to be 91.8%, and the AP value of the dense module model and the detection ratio model added to be 92.9%, which is improved by 2.3%, because 4-scale detection can accurately detect most of small targets. After 34000 steps of training, the model loss value added with the detection proportion begins to be stable, and about 36000 steps of training are needed when the model loss value of the dense modules is introduced to achieve convergence. Meanwhile, the final loss of the model only introducing the dense modules is about 1.41, while the final loss of the model added with the detection proportion is about 1.06, which is reduced by 0.35, and this shows that the model has faster convergence speed and better convergence result by adding one detection proportion.

FIG. 12 shows the effect of DIoU loss on model accuracy, and using the DIoU loss, the AP value of the model increased from 92.7% to 93.4%, which is a 1.5% increase.

The generalization capability of the model is improved by introducing the dense module, the target which cannot be detected by the original YOLOV3 model is detected, and the detection precision of the model is improved to a certain extent by adding the detection proportion and using the DIoU loss. Under the conditions of small data set and complicated and variable data, the improved Yolov3 model is excellent in small target detection and generalization.

To analytically validate the effectiveness of data enhancement using BEGAN, a comparison was made on the original image dataset and the augmented dataset using the modified YOLOV3 model. As shown in fig. 13, after data enhancement by BEGAN, the AP value of the model increased from 93.4% to 95.3%, which is a 2% increase. This shows that the diversity of the training data set is enriched by using the BEGAN to generate the image data, and the robustness of the detection model can be effectively enhanced.

To further analyze the effect of the size of the image dataset on the model. We used the BEGAN enhanced data set as a reference, from which 400, 800, 1200 images were randomly selected to form a new data set, on which the improved YOLOV3 model was trained to obtain the corresponding P-R curve, as shown in fig. 14. The experimental result shows that the data size has a great influence on the detection capability of the model, when a data set of 400 images is used, the detection capability of the model is very weak, and the detection performance of the model is gradually enhanced along with the increase of the size of the training set. Furthermore, the improvement capability of data enhancement is limited, and when the number of images of the training set exceeds 1200, the speed of model performance improvement starts to slow down as the number of images increases.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A pine cone detection method based on BEGAN and YOLOV3 models is characterized by comprising the following steps:

2. The method for detecting pine cones based on the BEGAN and YOLOV3 models as claimed in claim 1, wherein the pine cone image after data enhancement is divided by using YOLOV3 model, and dense connection network structure is introduced into the YOLOV3 model, comprising:

3. The pine cone detection method based on the BEGAN and YOLOV3 models as claimed in claim 2, wherein the pine cone image after data enhancement is divided by using a YOLOV3 model, and a dense connection network structure is introduced into the YOLOV3 model, further comprising:

4. The method of claim 3, wherein the expanding the detection ratio of the YOLOV3 model and optimizing the loss function of the YOLOV3 model using DIoU algorithm comprises:

5. The method of claim 4, wherein the detection ratio of the YOLOV3 model is expanded and the loss function of the YOLOV3 model is optimized by a DIoU algorithm, and further comprising: