CN112668662A

CN112668662A - Outdoor mountain forest environment target detection method based on improved YOLOv3 network

Info

Publication number: CN112668662A
Application number: CN202011639547.8A
Authority: CN
Inventors: 彭志红; 蒋卓; 陈杰; 奚乐乐; 王星博
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-16
Anticipated expiration: 2040-12-31
Also published as: CN112668662B

Abstract

The invention discloses a field mountain forest environment target detection method based on an improved YOLOv3 network, which comprises the steps of manufacturing a target detection data set of a field mountain forest environment, introducing a space transformation layer based on an attention mechanism to be embedded into a YOLOv3 detection model, adding a de-STL layer on the basis to facilitate model training, finally designing an improved YOLOv3 detection network, and then performing fine-tuning on the network to obtain a target detection model finally used in the field mountain forest environment; the invention can accurately detect the target in the field mountain forest environment, and improves the precision of the target detection of the detector in the field mountain forest environment and the recall rate of the small-scale target.

Description

Outdoor mountain forest environment target detection method based on improved YOLOv3 network

Technical Field

The invention belongs to the technical field of computer vision and machine learning, and particularly relates to a field mountain forest environment target detection method based on an improved YOLOv3 network.

Background

With the development of science and technology, the target detection technology has become a popular direction and research focus in the field of computer vision. The target detection technology is applied to various actual scenes, such as the fields of unmanned driving, unmanned aerial vehicle monitoring, scene recognition and the like, but the target detection algorithm in the field mountain forest environment still has many problems. Due to the fact that the outdoor mountain forest environment has high complexity, the situations that illumination changes violently, climate changes are irregular, shielding between a target and a non-target is serious and the like exist, the task difficulty of target detection and the like based on vision is increased, and in addition, due to actual requirements, the detection speed also needs to meet real-time performance.

Traditional target detection algorithms rely primarily on the apparent characteristics of real objects. For the object with rich texture, feature descriptors such as SIFT, PCA-SIFT, SURF and the like are designed manually, and feature points with strong representation are extracted from the image for further matching detection. For the object with less texture or even no texture, the template matching method is the preferred solution. The core problem of template matching is to design a reasonable and universal distance measurement mode. However, the above two methods are easily affected by the environment, such as shading, confusing background, etc., and the conventional method is very sensitive to the illumination variation and noise of the environment.

With the development of deep learning techniques in recent years, more and more engineering techniques are beginning to apply deep convolutional networks (CNN) to solve practical problems. By using the target detection method of deep learning, a feature descriptor designed manually is not needed, and the characteristics of deep learning are utilized to enable the network to learn higher-level semantic information. And the deep convolutional network (CNN) is more robust to factors such as illumination, scale change and noise, and the generalization capability of the model and the precision of the target detection algorithm are greatly improved.

In the literature (Girshick R, Donahue J, Darrell T, et al. Rich feature technologies for the object detection and the magnetic segmentation [ C ]// Proceedings of the IEEE conference on computer vision and pattern recognition.2014:580-587.), a target detection algorithm based on deep learning, which is a typical Two Stage algorithm, is proposed first, and the detection is mainly performed by using a method of candidate frame extraction + classification regression, in addition, typical algorithms include Faster R-CNN proposed in the literature (Ren S, He K, Girshick R, et al. Faster R-CNN: forward real-time object detection with region processing systems [ C ]// Advances in processing systems [ 2015 ]: 91-99.), although this series of algorithms have higher detection accuracy, but due to the complexity of the calculation process, the speed of the network is far from meeting the requirement of real-time performance. In order to balance the requirements of both precision and speed, another class of One Stage target detection models based on regression ideas has received much attention, among which are the YOLO series (Redmon J, Divvala S, Girshick R, et al. You only lock on: Unifield, real-time object detection [ C ]// Proceedings of the IEEE con on component vision and pattern recognition.2016:779-788.) and the SSD series (Liu W, Anguelov D, Erhan D, et al. Ssd: Single housing multi detector [ C ]// European con on component vision Springer, Cham,2016: 21-37.). The SSD detection model is a set of default boxes (default boxes) that discretize the output space of the bounding box into different aspect ratios, fine-tunes the default boxes, and detects the default boxes on feature maps of different scales, but the accuracy is higher, but the real-time performance is far less than that of YOLO. The real-time performance of the YOLO series is very good, the mutual balance of speed and precision is basically achieved by the YOLOv3 model, the detection effect based on small targets and multiple targets is greatly improved, and compared with an SSD model, the YOLO series is more widely applied due to the high efficiency of the YOLO.

Disclosure of Invention

In view of this, the invention provides a method for detecting targets in a field mountain forest environment based on an improved YOLOv3 network, which can effectively improve the accuracy of target detection in the field mountain forest environment by using a new network model.

The technical scheme for realizing the invention is as follows:

a field mountain forest environment target detection method based on an improved YOLOv3 network comprises the following steps:

acquiring a background picture and a foreground object of a field mountain forest, and presetting detection target types of people and vehicles;

secondly, superposing the background and the foreground through image preprocessing to generate a data set, acquiring bounding box and category data of a foreground object, and generating an xml file with the same format as the PASCALVOC2012 data set to obtain a training set, a verification set and a test set;

thirdly, based on a YOLOv3 network model, adding a spatial transformation layer STL behind feature extraction layers with different scales, training by taking feature maps with different scales as input to obtain different affine transformation results, and performing subsequent classification and bounding box regression on output features of the STL layer, wherein the method can effectively solve the problem of detection effect reduction when a target rotates and scales change in a field mountain forest environment;

fourthly, a de-STL conversion layer is added at last in the network, so that the final calculation result is matched with the encoding result of a true value bounding box which takes the feature map as the coordinate system before affine transformation, x and y corresponding to the original image coordinate system are obtained, and loss is calculated;

and step five, training the improved YOLOv3 network obtained in the step four by using the training set, the verification set and the test set obtained in the step two to obtain a performance optimal model.

Further, the step one specifically includes the following processes: respectively crawling four key words of forest, valley, plain and wetland on the Internet by utilizing python3, Beautiful Soup, requests and lxml, wherein the number of the four key words is 700, 600 and 600, and the unsuitable picture is manually removed; data of people (person, category _ id ═ 1) and cars (car, category _ id ═ 3) in the COCO2014 dataset are selected as foreground objects, as the objects contained in the data are more daily, the deformation and the shielding situations are more, and the foreground objects are randomly inserted into the background image to construct the field mountain forest environment dataset.

Further, the third step specifically includes the following steps: the design of the STL layer based on the attention mechanism focuses on localization net, 6 parameters are output for an input feature map to perform affine transformation on the original feature map, and therefore the problem of detection accuracy reduction caused by rotation and scale change of a target is solved; therefore, after the STL is embedded into the conv26, conv43 and conv52 characteristic diagrams of the dark net-53, the information loss amount is small after passing through the STL layer, the sensitivity to the rotation change is ensured, and then loss calculation such as classification and regression is carried out on the output result.

Further, the step four specifically includes the following processes: embedding the de-STL layer behind the convolution layer of the image target position so as to conveniently calculate the loss of location; wherein the loss of location is:

wherein λ is_coordFor the loss factor, S is the number of feature grid cells, B is the predicted box number,

defining that if the target exists in the grid unit i, the jth bounding box prediction value is effective to the prediction, and is 1 at this time; if no target exists in grid cell i, it is 0;

respectively representing the target predicted position and the width and height values of the output of the detection network,

the position of the target in the dataset and the value of the width height.

Further, the fifth step is specifically:

(1) the Yolov3 model has a total of 9 anchors, 3 outputs with different scales, and each output uses 3 anchors, so each position of the output predicts 3 boxes; for each box, the output parameters comprise the target position coordinates and the width and height values, the confidence scores of objects in the box and the probability of each type of object in the box;

(2) the loss function for training the entire network is configured as follows:

Loss＝λ₁Loss_loc+λ₂Loss_conf+λ₃Loss_cls

in the above formula, λ₁Is a target position loss coefficient, λ₂Is the target confidence loss coefficient, λ₃A target class loss factor;

loss of target confidence Loss_confBinary Cross Entropy loss (Binary Cross Entropy) was used, as_iE {0,1} represents whether the target really exists in the predicted target boundary box i, 0 represents the absence, and 1 represents the existence;

defining that if the target exists in the grid unit i, the jth bounding box prediction value is effective to the prediction, and is 1 at this time; if no target is present in grid cell i, this time 0,

conversely, if there is a target in grid cell i, then the jth bounding box predictor is valid for the prediction, which is 0 at this time; if no target exists in grid cell i, then it is 1; lambda [ alpha ]_noobjFor the current grid without loss coefficient of real target object, lambda_objThe loss coefficient of a real target object exists in the current grid;

sigmoid probability representing whether or not there is an object in the predicted object rectangular box i, c_iRepresents the true value, and the target confidence Loss_confThe formula of (1) is:

loss of target class Loss_clsAlso employed is a binary cross-entropy loss, where o_ijE {0,1} represents whether a jth class target really exists in a predicted target boundary box i, 0 represents nonexistence, and 1 represents existence;

the Sigmoid probability, Loss, of the j-th class target in the network prediction target boundary box i is shown_clsThe formula of (1) is:

(3) training is carried out by utilizing a designed loss function form and an SGD (random gradient descent) method, and an Adam method is adopted as a gradient updating method.

Has the advantages that:

when the method is used for solving the target detection problem in the field mountain forest environment, the attention mechanism-based STL space conversion layer is introduced and connected to the characteristic input layer, so that the network can automatically learn affine transformation parameters to solve the rotation problem and the shielding problem of targets with different scales, the accuracy of small target detection can be improved, and the target detection precision is effectively improved. The final result shows that the method has high target detection effect on the wild mountain forest data set.

Drawings

FIG. 1 is a data set produced by the present invention;

FIG. 2 is a network model structure of the present invention;

fig. 3 is a detailed structure of the spatial transform layer STL;

FIG. 4 is a specific structure of the de-STL layer;

fig. 5 is a diagram showing the actual effect of the present invention.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

The invention mainly aims at the technical problem that the existing target detection method is difficult to deal with in the field mountain forest environment under the severe conditions of object shielding, rotation, scale change and the like. The core of the method is a Yolov3 target detection algorithm, which is a method for efficiently detecting targets with balanced precision and speed, and adopts a convolution network structure, and the specific structure of the convolution network structure is shown in figure 2. The invention can accurately detect the target in the field mountain forest environment, and improves the precision of the target detection of the detector in the field mountain forest environment and the recall rate of the small-scale target.

The detailed steps of the invention are as follows:

step 1: and acquiring a background picture and a foreground object of the wild mountain forest, and presetting detection types of people and vehicles. The invention uses python3+ Beautiful Soup + requests + lxml to crawl four pictures of "keywords" on the Internet, 700, 600 respectively, and manually eliminates the data of unsuitable pictures (such as pictures already containing foreground data and pictures with more watermarks), statistically, the average size of the pictures is [600,400] and the pictures are all stored in 'jpg' format, selects the data of people (person, category _ id 1) and cars (car, category _ id 3) in COCO2014 as foreground objects, on one hand, because the accurate segmentation calibration allows the extraction of the foreground objects, on the other hand, because the objects contained are more frequent, deformation and occlusion conditions are more favorable for increasing the robustness of the detector, and in order to extract the objects more completely, sets the extraction script to extract 2 picture and 1352 picture in total of the outline of people and cars, besides foreground data, pixel values of other positions in each picture are 0, and the picture can be conveniently superposed with a background picture.

Step 2: through image preprocessing, a background picture and foreground data are superposed, the picture effect after synthesis is as shown in fig. 1, a data set is generated, the enclosure frame and the category data of an object are automatically acquired in a script, and an xml file with the same format as that of the PASCALVOC2012 data set is generated for subsequent training. When composing a picture, the following rules are followed:

a) many deformation and shielding situations already exist in the COCO data set, so that the randomness in the aspect is not added;

b) adding scale changes to the foreground data such that (side with larger foreground/side with smaller background) is in the range of 0.2 to 0.45;

c) to make the synthesized picture more realistic, let | foreground pixel mean-background pixel mean | < 30;

d) each picture contains 3,2 and 1 foreground examples according to the probabilities of 0.2,0.3 and 0.5, and the probabilities of the examples belonging to the classes of people and vehicles are the same;

e) the bounding box format is (x _ min, y _ min), (x _ max, y _ max) format.

In addition, attention needs to be paid to the direction of the picture coordinate system when writing the script (for example, the height of the cv2. immead function default picture is the x direction, the width of the cv2.rectangle function default picture is the x direction, and the width of the picture is also the x direction in the xml file). 2500 pictures are synthesized in total, the pictures are randomly divided into a training set and a testing set according to the ratio of 4:1, the training sets and the testing sets are stored according to the storage structure of PACALVOC2012, and for convenience of description, the constructed data set is named as Detection 2500.

And step 3: designing the STL layer as shown in fig. 3, the spatial transform network has the following characteristics:

(1) the network can be easily embedded as a module in any network. (2) The parameters that the network needs to learn are the parameters of localization net. In other words, the network can autonomously learn what affine transformations should be done in the course of training to make the output of the classifier or detector with higher confidence. In addition, since the target detection task is different from the image classification task, besides the need to accurately know the type of the target, a bounding box of the target is also obtained. In view of this, and for two reasons, it is not reasonable to put STL to the first layer, i.e., after the input image: a) training of the STL may cause the input U to lose a portion of the information, and if features are extracted by the convolutional layer, an appreciable number of features may be lost, especially when the input image contains multiple objects simultaneously; b) one of the large features of the YOLOv3 structure is that different scales of feature maps are "responsible" for detecting different scales of objects. Whereas the difference between the "proper" affine transformations required for objects with larger differences in scale is also larger relative to objects with similar scales. Therefore, it is reasonable to train different affine transformations for feature maps of different scales using the feature maps as input. And embedding the STL into feature maps with different scales, and then performing operations such as classification and bounding box regression on the output of the STL.

And 4, step 4: the de-STL layer is designed as shown in FIG. 4, wherein "de" is a negative prefix, and means "inverse", defining de-STL as the inverse operation of STL. The reason for adding this operation is that after STL, the output of the location part is based on the feature map after the affine transformation as the coordinate system, which obviously does not match the encoding result of the truth bounding box based on the feature map before the affine transformation as the coordinate system, so the output of the location part needs to be inverse transformed, i.e. de-STL, to obtain x and y matching the coordinate system, and then calculate the loss, and further train the network. Wherein Loss_locComprises the following steps:

the position of the target in the dataset and the value of the width height. This is further calculated from the offset of the default box output by the network. The target position in the image commonly used in the data set is represented by (x _ min, y _ min), (x _ max, y _ max), i.e. the position coordinates of the upper left corner and the lower right corner of the bounding box. In order to use the above-mentioned loss calculation method and maintain the translation and scaling invariance of the bounding box, it is necessary to encode it, i.e. to represent the form of the bounding box by the coordinates of the center point and the width and height.

And 5: the network training process is configured as follows:

(1) there are a total of 9 anchors in the YOLOv3 model, 3 outputs of different scales, 3 anchors for each output, so each position of the output predicts 3 boxes. For each box, the output parameters include x, y, w, h, which are further calculated according to the offset of default box output by the network, and the confidence score of the box having objects and the probability of each type of object in the box. Thus, for a VOC data set that contains 20 categories, the output of YOLOv3 has 3 sizes: 13 × 13 × (3 × (20+5)) -13 × 13 × 75,26 × 26 × (3 × (20+5)) -26 × 26 × 75,52 × 52 × (3 × (20+5)) -52 × 52 × 75.

(2) The loss function for training the entire network is configured as follows:

Loss＝λ₁Loss_loc+λ₂Loss_conf+λ₃Loss_cls

conversely, if there is a target in grid cell i, then the jth bounding box predictor is valid for the prediction, which is 0 at this time; if no target exists in grid cell i, then it is 1; lambda [ alpha ]_noobjFor the current grid without loss coefficient of real target object, lambda_objLoss system with real target for current gridCounting;

loss of target class Loss_clsThe binary cross entropy loss is adopted, and the reason for adopting the binary cross entropy loss is that the same target can be classified into multiple types at the same time, for example, cats can be classified into cats and animals, so that the method can cope with more complex scenes. Wherein o is_ijE {0,1} represents whether a jth class target really exists in a predicted target boundary box i, 0 represents nonexistence, and 1 represents existence;

the Sigmoid probability of the j-th class target in the network prediction target boundary frame i is shown, S is the number of feature grid units, B is the predicted box number, Loss_clsThe formula of (1) is:

Compared with other detection methods, the model provided by the invention has higher detection precision and performance for the target detection task of the field mountain forest environment, and has higher accuracy for small target detection, and the specific detection effect is shown in fig. 5.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A field mountain forest environment target detection method based on an improved YOLOv3 network is characterized by comprising the following steps:

thirdly, based on a YOLOv3 network model, adding a spatial transformation layer STL behind feature extraction layers with different scales, training to obtain different affine transformation results by taking feature maps with different scales as input, and then performing subsequent classification and bounding box regression on output features of the STL layer;

2. The method for detecting the targets in the environment of the wild mountain forest based on the improved YOLOv3 network in claim 1, wherein the step one comprises the following steps: respectively crawling images of four keywords of forest, valley, plain and wetland on the Internet by utilizing python3+ Beautiful Soup + requests + lxml, wherein the number of the images is 700, 600 and 600, and manually removing unsuitable images; data of people and vehicles in the COCO2014 data set are selected as foreground objects, the objects contained in the data set are more daily, the deformation and shielding conditions are more, and the foreground objects are randomly inserted into the background image to construct the field mountain forest environment data set.

3. The method for detecting the target in the field mountain forest environment based on the improved YOLOv3 network as claimed in claim 1, wherein the step three comprises the following steps: the design of the STL layer based on the attention mechanism focuses on localization net, 6 parameters are output for an input feature map to perform affine transformation on the original feature map, and therefore the problem of detection accuracy reduction caused by rotation and scale change of a target is solved; therefore, after the STL is embedded into the conv26, conv43 and conv52 characteristic diagrams of the dark net-53, the information loss amount is small after passing through the STL layer, the sensitivity to the rotation change is ensured, and then loss calculation such as classification and regression is carried out on the output result.

4. The method for detecting the target in the field mountain forest environment based on the improved YOLOv3 network as claimed in claim 1, wherein the step four comprises the following steps: embedding the de-STL layer behind the convolution layer of the image target position so as to conveniently calculate the loss of location; wherein the loss of location is:

defining that if the target exists in the grid unit i, the jth bounding box prediction value is effective to the prediction, and is 1 at this time; if no target exists in grid cell i, it is 0; x is the number of_i,y_i,

the position of the target in the dataset and the value of the width height.

5. The method for detecting the targets in the field mountain forest environment based on the improved YOLOv3 network as claimed in claim 1, wherein the step five is specifically as follows:

(2) the loss function for training the entire network is configured as follows:

Loss＝λ₁Loss_loc+λ₂Loss_conf+λ₃Loss_cls

(3) and training by using a designed loss function form and an SGD (generalized Gaussian) method, wherein an Adam mode is adopted as a gradient updating mode.