CN112668662B

CN112668662B - Outdoor mountain forest environment target detection method based on improved YOLOv3 network

Info

Publication number: CN112668662B
Application number: CN202011639547.8A
Authority: CN
Inventors: 彭志红; 蒋卓; 陈杰; 奚乐乐; 王星博
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-12-06
Anticipated expiration: 2040-12-31
Also published as: CN112668662A

Abstract

The invention discloses a field mountain forest environment target detection method based on an improved YOLOv3 network, which comprises the steps of manufacturing a target detection data set of a field mountain forest environment, introducing a space transformation layer based on an attention mechanism to be embedded into a YOLOv3 detection model, adding a de-STL layer on the basis to facilitate model training, finally designing an improved YOLOv3 detection network, and then performing fine-tuning on the network to obtain a target detection model finally used in the field mountain forest environment; the invention can accurately detect the target in the field mountain forest environment, and improves the precision of the target detection of the detector in the field mountain forest environment and the recall rate of the small-scale target.

Description

Outdoor mountain forest environment target detection method based on improved YOLOv3 network

Technical Field

The invention belongs to the technical field of computer vision and machine learning, and particularly relates to a field mountain forest environment target detection method based on an improved YOLOv3 network.

Background

With the development of science and technology, the target detection technology has become a popular direction and research focus in the field of computer vision. The target detection technology is applied to various actual scenes, such as the fields of unmanned driving, unmanned aerial vehicle monitoring, scene recognition and the like, but the target detection algorithm in the field mountain forest environment still has many problems. Due to the fact that the outdoor mountain forest environment has high complexity, the situations that illumination changes violently, climate changes are irregular, shielding between a target and a non-target is serious and the like exist, the task difficulty of target detection and the like based on vision is increased, and the detection speed also needs to meet real-time performance due to actual requirements.

Traditional target detection algorithms rely primarily on the apparent characteristics of real objects. For the object with rich texture, feature descriptors such as SIFT, PCA-SIFT, SURF and the like are designed manually, feature points with strong representation are extracted from the image, and then matching detection is carried out. For the object with less texture or even no texture, the template matching method is the preferred solution. The core problem of template matching is to design a reasonable and universal distance measurement mode. However, the above two methods are easily affected by the environment, such as shading, confusing background, etc., and the conventional method is very sensitive to the illumination variation and noise of the environment.

With the development of deep learning techniques in recent years, more and more engineering techniques are beginning to apply deep convolutional networks (CNN) to solve practical problems. By using the target detection method of deep learning, a feature descriptor designed manually is not needed, and the characteristics of deep learning are utilized to enable the network to learn higher-level semantic information. And the deep convolutional network (CNN) is more robust to factors such as illumination, scale change and noise, and the generalization capability of the model and the accuracy of the target detection algorithm are greatly improved.

In the literature (Girshick R, donahue J, darrell T, et al, rich feature technologies for the acquisition object detection and the management segmentation [ C ]// Proceedings of the IEEE conference on computer vision and pattern recognition.2014-587.), a deep learning-based target detection algorithm is firstly proposed, which is a typical Two Stage algorithm, and mainly uses a method of candidate box extraction + classification regression for detection, and in addition, the typical algorithms also have a document (Ren S, he K, girshick R, ingredient R-nnn: todards real-e object detection with a real-time processing work [ C ]// advance in network processing system 91.2015-99. The algorithm can not meet the requirements of the network speed calculation in real-time due to the high precision of the algorithm. To balance the requirements of both precision and speed, another class of One Stage target detection models based on regression ideas has received much attention, among which the more typical are the YOLO series (Redmon J, divvala S, girshick R, et al. You only lock on: unifield, real-time object detection [ C ]// Proceedings of the IEEE con on component vision and pattern registration.2016: 779-788.) and the SSD series (Liu W, anguelo D, han D, et al. Ssd: single shot detection [ C ]// European con on component vision. Springer, cham,2016 21-37.). The SSD detection model is a set of default boxes (default boxes) that discretize the output space of the bounding box into different aspect ratios, fine-tunes the default boxes, and detects the default boxes on feature maps of different scales, but the accuracy is higher, but the real-time performance is far less than that of YOLO. The real-time performance of the YOLO series is very good, the mutual balance of speed and precision is basically achieved by the YOLOv3 model, the detection effect based on small targets and multiple targets is greatly improved, and compared with an SSD model, the YOLO series is more widely applied due to the high efficiency of the YOLO.

Disclosure of Invention

In view of this, the invention provides a method for detecting targets in a field forest environment based on an improved YOLOv3 network, which can effectively improve the accuracy of target detection in the field forest environment by using a new network model.

The technical scheme for realizing the invention is as follows:

a field mountain forest environment target detection method based on an improved YOLOv3 network comprises the following steps:

acquiring a background picture and a foreground object of a field mountain forest, and presetting detection target types of people and vehicles;

secondly, superposing the background and the foreground through image preprocessing to generate a data set, acquiring bounding box and category data of a foreground object, and generating an xml file with the same format as the PASCALVOC2012 data set to obtain a training set, a verification set and a test set;

thirdly, based on a YOLOv3 network model, adding a spatial transformation layer STL behind feature extraction layers of different scales, training by taking feature maps of different scales as input to obtain different affine transformation results, and performing subsequent classification and bounding box regression on output features of the STL layer, wherein the method can effectively solve the problem of detection effect reduction when a target rotates and scales change in a field mountain forest environment;

fourthly, a de-STL conversion layer is added at last in the network, so that the final calculation result is matched with the encoding result of a true value bounding box which takes the feature map as the coordinate system before affine transformation, x and y corresponding to the original image coordinate system are obtained, and loss is calculated;

and step five, training the improved YOLOv3 network obtained in the step four by using the training set, the verification set and the test set obtained in the step two to obtain a performance optimal model.

Further, the first step specifically includes the following steps: by using

Obtaining images of four keywords of forest, valley, plain and wetland on the internet by python3+ Beautiful Soup + requests + lxml, wherein the number of the images is 700, 600 and 600, and the images which are not suitable are manually removed; data of people (person, category _ id = 1) and cars (car, category _ id = 3) in the COCO2014 data set are selected as foreground objects, as objects contained in the data set are more daily and have more deformation and shielding situations, and the foreground objects are randomly inserted into the background image to construct the field mountain forest environment data set.

Further, the third step specifically includes the following steps: the design of the STL layer based on the attention mechanism focuses on localization net, 6 parameters are output for an input feature map to perform affine transformation on the original feature map, and therefore the problem of detection accuracy reduction caused by rotation and scale change of a target is solved; therefore, after the STL is embedded into the conv26, conv43 and conv52 characteristic diagrams of the dark net-53, the small information loss amount and the sensitivity to the rotation change are ensured after passing through the STL layer, and then the loss calculation such as classification, regression and the like is carried out on the output result.

Further, the step four specifically includes the following processes: embedding the de-STL layer behind the convolution layer of the image target position so as to conveniently calculate the loss of location; wherein the loss of location is:

wherein λ is _coord For the loss factor, S is the number of feature grid cells, B is the number of predicted boxes,

defining that if the target exists in the grid unit i, the jth bounding box prediction value is effective to the prediction, and is 1 at this time; if no target exists in grid cell i, it is 0; x is the number of _i ,y _i ,

Respectively representing the target predicted position and the width and height values of the output of the detection network,

the position of the target in the dataset and the value of the width height.

Further, the fifth step is specifically:

(1) The YOLOv3 model has 9 anchors in total, 3 outputs with different scales are used, and each output uses 3 anchors, so that each position of the output predicts 3 boxes; for each box, the output parameters comprise the target position coordinates and the width and height values, the confidence scores of objects in the box and the probability of each type of object in the box;

(2) The loss function for training the entire network is configured as follows:

Loss＝λ ₁ Loss _loc +λ ₂ Loss _conf +λ ₃ Loss _cls

in the above formula, λ ₁ Is a target position loss coefficient, λ ₂ Is the target confidence loss coefficient, λ ₃ A target class loss factor;

loss of target confidence Loss _conf Binary Cross Entropy loss (Binary Cross Entropy) was used, as _i E {0,1} represents whether the target actually exists in the predicted target boundary box i, 0 represents absence, and 1 represents existence;

defining that if the target exists in the grid unit i, the jth bounding box prediction value is effective to the prediction, and is 1 at this time; if no target is present in grid cell i, this time 0,

conversely, if there is a target in grid cell i, then the jth bounding box predictor is valid for the prediction, which is 0 at this time; if no target exists in grid cell i, then it is 1; lambda [ alpha ] _noobj For the current grid without loss coefficient of real target object, lambda _obj The loss coefficient of a real target object exists in the current grid;

sigmoid probability representing whether or not there is an object in the predicted object rectangular box i, c _i Represents the true value, and the target confidence Loss _conf The formula of (1) is as follows:

loss of target class Loss _cls Also employed is a binary cross-entropy loss, where o _ij E {0,1} represents whether the j-th class target really exists in the predicted target boundary box i, 0 represents the absence, and 1 represents the existence;

the Sigmoid probability, loss, of the j-th class target in the network prediction target boundary box i is shown _cls The formula of (1) is:

(3) Training is carried out by using a designed loss function form and an SGD (random gradient descent) method, and an Adam mode is adopted as a gradient updating mode.

Has the beneficial effects that:

when the method is used for solving the target detection problem in the field mountain forest environment, the attention mechanism-based STL space conversion layer is introduced and connected to the characteristic input layer, so that the network can automatically learn affine transformation parameters to solve the rotation problem and the shielding problem of targets with different scales, the accuracy of small target detection can be improved, and the target detection precision is effectively improved. The final result shows that the method has high target detection effect on the wild mountain forest data set.

Drawings

FIG. 1 is a data set produced by the present invention;

FIG. 2 is a network model structure of the present invention;

fig. 3 is a detailed structure of the spatial transform layer STL;

FIG. 4 is a specific structure of the de-STL layer;

fig. 5 is a diagram showing the actual effect of the present invention.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

The invention mainly aims at the technical problem that the existing target detection method is difficult to deal with in the field mountain forest environment under the severe conditions of object shielding, rotation, scale change and the like. The core of the method is a YOLOv3 target detection algorithm, which is a method for efficiently detecting targets with balanced precision and speed, and adopts a convolution network structure, and the specific structure of the convolution network structure is shown in figure 2. The invention can accurately detect the target in the field mountain forest environment, and improves the precision of the target detection of the detector in the field mountain forest environment and the recall rate of the small-scale target.

The detailed steps of the invention are as follows:

step 1: and acquiring a background picture and a foreground object of the wild mountain forest, and presetting detection types of people and vehicles. The invention uses python3+ Beautiful Soup + requests + lxml to respectively obtain four pictures of keywords of forest, valley, plain and wetland on the Internet, the number of the pictures is 700, 600, and the unsuitable pictures (such as pictures already containing foreground data and pictures with more watermarks) are manually rejected, and the average size of the pictures is [600,400] and the pictures are saved in a 'jpg' format.

And 2, step: through image preprocessing, a background picture and foreground data are superposed, the picture effect after synthesis is as shown in fig. 1, a data set is generated, the enclosure frame and the category data of an object are automatically acquired in a script, and an xml file with the same format as that of the PASCALVOC2012 data set is generated for subsequent training. When composing a picture, the following rules are followed:

a) Many deformation and shielding situations exist in the COCO data set, so that the randomness in the aspect is not added;

b) Adding scale changes to the foreground data such that (side with larger foreground/side with smaller background) is in the range of 0.2 to 0.45;

c) To make the synthesized picture more realistic, let | foreground pixel mean-background pixel mean | <30;

d) The probability of each picture according to 0.2,0.3,0.5 comprises 3,2,1 foreground instances, and the probability that each instance belongs to the category of people and vehicles is the same;

e) The bounding box format is (x _ min, y _ min), (x _ max, y _ max) format.

In addition, attention needs to be paid to the direction of the picture coordinate system when writing the script (for example, the height of the cv2. Immead function default picture is the x direction, the width of the cv2.Rectangle function default picture is the x direction, and the width of the picture is also the x direction in the xml file). 2500 pictures are synthesized, randomly divided into a training set and a testing set according to the proportion of 4:1, stored according to the storage structure of PACALVOC2012, and named and constructed to be Detection2500 for convenience of description.

And step 3: designing the STL layer as shown in fig. 3, the spatial transform network has the following characteristics: (1) The network can be easily embedded as a module in any network. (2) The parameters that the network needs to learn are the parameters of localization net. In other words, the network can autonomously learn what affine transformations should be done in the course of training to make the output of the classifier or detector with higher confidence. In addition, since the target detection task is different from the image classification task, besides the need to accurately know the type of the target, a bounding box of the target is also obtained. Taking this into account, and placing the STL in the first layer, i.e., behind the input image, is not reasonable for two reasons: a) Training of the STL may cause the input U to lose a portion of the information, and if features are extracted by the convolutional layer, an appreciable number of features may be lost, especially when the input image contains multiple objects simultaneously; b) One of the large characteristics of the YOLOv3 structure is that different scales of feature maps are "responsible" for detecting different scales of targets. Whereas the difference between the "proper" affine transformations required for objects with larger differences in scale is also larger relative to objects with similar scales. Therefore, it is reasonable to train different affine transformations for feature maps of different scales using the feature maps as input. And embedding the STL into feature maps with different scales, and then performing operations such as classification and bounding box regression on the output of the STL.

And 4, step 4: the de-STL layer is designed as shown in FIG. 4, wherein "de" is a negative prefix, and means "inverse", defining de-STL as the inverse operation of STL. The reason for adding this operation is that after STL, the output of the location part is based on the feature map after the affine transformation as the coordinate system, which obviously does not match the encoding result of the truth bounding box based on the feature map before the affine transformation as the coordinate system, so the output of the location part needs to be inverse transformed, i.e. de-STL, to obtain x and y matching the coordinate system, and then calculate the loss, and further train the network. Wherein Loss _loc Comprises the following steps:

wherein λ is _coord For the loss factor, S is the number of feature grid cells, B is the predicted box number,

defining that if the target exists in the grid unit i, the jth bounding box prediction value is effective to the prediction, and is 1 at this moment; if no target exists in grid cell i, it is 0; x is a radical of a fluorine atom _i ,y _i ,

the position of the target in the dataset and the value of the width height. This is further calculated from the offset of the default box output by the network. The target position in the image commonly used in the data set is represented by (x _ min, y _ min), (x _ max, y _ max), i.e. the position coordinates of the upper left corner and the lower right corner of the bounding box. In order to use the above-mentioned loss calculation method and maintain the translation and scaling invariance of the bounding box, it is necessary to encode it, i.e. to represent the form of the bounding box by the coordinates of the center point and the width and height.

And 5: the configuration network training process is as follows:

(1) There are a total of 9 anchors in the yollov 3 model, 3 outputs of different scales, 3 anchors for each output, so each position of the output predicts 3 boxes. For each box, the output parameters include x, y, w, h, which are further calculated according to the offset of default box output by the network, and the confidence score of the box having objects and the probability of each type of object in the box. Thus, for a VOC data set that contains 20 categories, the output of YOLOv3 has 3 sizes: 13 × 13 × (3 × (20 + 5)) =13 × 13 × 75,26 × 26 × (3 × (20 + 5)) =26 × 26 × 75,52 × 52 × (3 × (20 + 5)) =52 × 52 × 75.

(2) The loss function for training the entire network is configured as:

Loss＝λ ₁ Loss _loc +λ ₂ Loss _conf +λ ₃ Loss _cls

loss of target confidence Loss _conf Binary Cross Entropy losses (Binary Cross Entropy) were used, using o _i E {0,1} represents whether the target actually exists in the predicted target boundary box i, 0 represents absence, and 1 represents existence;

defining that if the target exists in the grid unit i, the jth bounding box prediction value is effective to the prediction, and is 1 at this moment; if there is no target in grid cell i, which is 0 at this time,

conversely, if there is a target in grid cell i, then the jth bounding box predictor is valid for the prediction, which is 0 at this time; if no target exists in grid cell i, it is 1; lambda _noobj For the current grid without loss coefficient of real target object, lambda _obj The loss coefficient of a real target object exists in the current grid;

sigmoid probability representing whether or not there is a target in the predicted target rectangular box i, c _i Represents the true value, and the target confidence Loss _conf The formula of (1) is:

loss of target class Loss _cls Also using binary cross entropy lossesThe reason is that the same object can be classified into a plurality of categories at the same time, for example, cats can be classified into cats and animals, which can cope with more complicated scenes. Wherein o is _ij E {0,1} represents whether the j-th class target really exists in the predicted target boundary box i, 0 represents the absence, and 1 represents the existence;

the probability of Sigmoid of j class target in the boundary box i of the network prediction target is shown, S is the number of feature grid units, B is the predicted box number, loss _cls The formula of (1) is:

(3) Training is carried out by utilizing a designed loss function form and an SGD (random gradient descent) method, and an Adam method is adopted as a gradient updating method.

Compared with other detection methods, the model provided by the invention has higher detection precision and performance for the target detection task of the field mountain forest environment, and has higher accuracy for the detection of small targets, and the specific detection effect is shown in fig. 5.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A field mountain forest environment target detection method based on an improved YOLOv3 network is characterized by comprising the following steps:

thirdly, based on a YOLOv3 network model, adding a spatial transformation layer STL behind feature extraction layers with different scales, training to obtain different affine transformation results by taking feature maps with different scales as input, and then performing subsequent classification and bounding box regression on output features of the STL layer;

fourthly, adding an inverse spatial transform layer de-STL at the end of the network, enabling a final calculation result to be matched with a coding result of a true value surrounding frame which takes the feature map as a coordinate system before affine transformation to obtain x and y corresponding to the original image coordinate system, and then calculating the loss of location, the loss of target confidence coefficient and the loss of target category to obtain an improved YOLOv3 network model;

and step five, training the improved YOLOv3 network obtained in the step four by using the training set, the verification set and the test set obtained in the step two to obtain an optimal performance model, and performing target detection by using the optimal model.

2. The method for detecting the targets in the environment of the wild mountain forest based on the improved YOLOv3 network as claimed in claim 1, wherein the first step specifically comprises the following steps: acquiring pictures of four keywords of forest, valley, plane and welland on the Internet by utilizing python3+ Beautiful Soup + requests + lxml, wherein the number of the pictures is 700, 600 and 600 respectively, and unsuitable pictures are manually removed; and selecting data of people and vehicles in the COCO2014 data set as foreground objects, and randomly inserting the foreground objects into the background image to construct a field mountain forest environment data set.

3. The method for detecting the target in the wild mountain forest environment based on the improved YOLOv3 network as claimed in claim 1, wherein the third step specifically comprises the following steps: the design of the STL layer based on the attention mechanism focuses on localization net, 6 parameters are output for an input feature map to perform affine transformation on the original feature map, and therefore the problem of detection accuracy reduction caused by rotation and scale change of a target is solved; therefore, after the STL is embedded into the conv26, conv43 and conv52 characteristic maps of the dark net-53, the small information loss amount and the sensitivity to the rotation change are ensured after the STL passes through the STL layer, and then the classification and regression loss calculation are carried out on the output result.

4. The method for detecting the targets in the field mountain forest environment based on the improved YOLOv3 network as claimed in claim 1, wherein the step four specifically comprises the following processes: embedding the de-STL layer behind the convolution layer of the image target position so as to conveniently calculate the loss of location; wherein the loss of location is:

defining that if the target exists in the grid unit i, the jth bounding box prediction value is effective to the prediction, and is 1 at this moment; if no target exists in grid cell i, it is 0; x is the number of _i ,y _i ,

the position of the target in the dataset and the value of the width height.

5. The method for detecting the targets in the field mountain forest environment based on the improved YOLOv3 network as claimed in claim 1, wherein the step five is specifically as follows:

(1) The YOLOv3 model has 9 anchors in total, 3 outputs with different scales and 3 anchors for each output, so that each position of the output predicts 3 boxes; for each box, the output parameters comprise the target position coordinates and the width and height values, the confidence scores of objects in the box and the probability of each type of object in the box;

(2) The loss function for training the entire network is configured as follows:

Loss＝λ ₁ Loss _loc +λ ₂ Loss _conf +λ ₃ Loss _cls

loss of target confidence Loss _conf Using a binary cross-entropy loss of _i E {0,1} represents whether the target actually exists in the predicted target boundary box i, 0 represents absence, and 1 represents existence;

conversely, if there is a target in grid cell i, then the jth bounding box predictor is valid for the prediction, which is 0 at this time; if no target exists in grid cell i, then it is 1; lambda _noobj For the current grid without loss coefficient of real target object, lambda _obj The loss coefficient of a real target object exists in the current grid;

indicating the Sigmoid probability, loss, of the j-th class target in the network prediction target boundary box i _cls The formula of (1) is:

(3) And training by using a designed loss function form and an SGD (generalized Gaussian) method, wherein an Adam mode is adopted as a gradient updating mode.