CN115272242B

CN115272242B - YOLOv 5-based optical remote sensing image target detection method

Info

Publication number: CN115272242B
Application number: CN202210909740.1A
Authority: CN
Inventors: 侯彪; 李智德; 汤奇; 任仲乐; 任博; 杨晨; 焦李成
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2024-02-27
Anticipated expiration: 2042-07-29
Also published as: CN115272242A

Abstract

The invention relates to a YOLOv 5-based optical remote sensing image target detection method, which is characterized by comprising the following steps of: step 1: acquiring an optical remote sensing image to be detected, wherein the optical remote sensing image to be detected contains a target to be detected; step 2: cutting the optical remote sensing image to be detected into a plurality of optical remote sensing sub-images to be detected; step 3: inputting an optical remote sensing sub-image to be detected into a pre-trained YOLOv5 target detection model to obtain a corresponding sub-image detection result, wherein the detection result comprises a target detection frame and a classification-intersection ratio; step 4: merging the sub-image detection results to obtain a detection result of the optical remote sensing image to be detected; the YOLOv5 target detection model comprises a backbone network, a neck network and a detection head which are cascaded, wherein the neck network is a CSP-BiFPN network. The method for detecting the optical remote sensing image target based on the YOLOv5 has higher detection precision and stronger capability of distinguishing targets with different scales.

Description

YOLOv 5-based optical remote sensing image target detection method

Technical Field

The invention belongs to the technical field of optical remote sensing image airplane detection, and particularly relates to a YOLOv 5-based optical remote sensing image target detection method.

Background

Conventional target detection methods are typically designed based on manually extracted features, are typically specific to a particular scene, and require extensive parameter optimization, so these methods are inferior in generalization. The conventional method is not applicable to optical remote sensing images with increasingly complex scenes.

With the rapid development of deep learning, the generalization performance of the features extracted by using the convolutional neural network is far higher than that of the features extracted by the traditional manual method. Current object detection models generally comprise two parts: some networks for feature extraction include a backbone network and a neck network, wherein the backbone network is usually pre-trained on a large-scale picture dataset to obtain better generalization capability and stronger feature extraction function, and the neck network fuses feature layers with different downsampling magnifications to identify and locate targets with different sizes. The other part is a detection head, and the extracted characteristic network is used for classifying and returning coordinates.

The object detection model may be further divided into a single-stage model and a two-stage model according to whether or not the pre-extracted candidate region exists. The two-stage model is represented by a fast-RCNN, the two-stage model brings precision improvement, meanwhile, no single-stage model is fast in reasoning speed, and meanwhile, the fact that an Anchor frame Anchor needs to be set manually in an RPN network is one of problems. The most representative models in the single-stage model are YOLO, SSD, retinaNet, respectively. The method of the single-stage model is simpler than that of the two-stage model, and classification and regression are only carried out on the extracted feature images, so that the method has the advantages of relatively high reasoning speed, but relatively poorer precision than that of the two-stage model.

The neck network adopted by most of the current target detection models is a feature pyramid network FPN and a path aggregation network PAN, including a YOLOv5 model, and the neck network has the problem that input features with different downsampling multiplying power are simply added together, and the contribution degree of the input features to the final fusion features is not considered, so that the advantages of the fusion features are not fully exerted, and the capability of the model to distinguish targets with different scales is reduced. In addition, the classification branch and the regression branch in the detection head of the current target detection model are usually independent, and are not directly connected, so that when the predicted classification score is high, the predicted detection frame deviation is large, or when the predicted detection frame is accurate, the classification score is low, and the model precision is low.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a YOLOv 5-based optical remote sensing image target detection method. The technical problems to be solved by the invention are realized by the following technical scheme:

the invention provides a YOLOv 5-based optical remote sensing image target detection method, which comprises the following steps:

step 1: acquiring an optical remote sensing image to be detected, wherein the optical remote sensing image to be detected contains a target to be detected;

step 2: cutting the optical remote sensing image to be detected into a plurality of optical remote sensing sub-images to be detected;

step 3: inputting the optical remote sensing sub-image to be detected into a pre-trained YOLOv5 target detection model to obtain a corresponding sub-image detection result, wherein the detection result comprises a target detection frame and a classification-cross ratio;

step 4: combining the sub-image detection results to obtain a detection result of the optical remote sensing image to be detected;

the YOLOv5 target detection model comprises a backbone network, a neck network and a detection head which are cascaded, wherein the neck network is a CSP-BiFPN network.

In one embodiment of the present invention, the backbone network is configured to perform feature extraction on the input optical remote sensing sub-image to be detected, so as to obtain a high-level semantic feature, a middle-level semantic feature and a low-level semantic feature corresponding to the optical remote sensing sub-image to be detected;

the neck network is used for fusing the high-level semantic features, the middle-level semantic features and the low-level semantic features to obtain high-level semantic fusion features, middle-level semantic fusion features and low-level semantic fusion features;

the detection head is used for determining and outputting a detection result of the optical remote sensing sub-image to be detected according to the high-level semantic fusion feature, the middle-level semantic fusion feature and the low-level semantic fusion feature.

In one embodiment of the present invention, the CSP-BiFPN network comprises a plurality of CSP-BiFPN sub-networks in series, the CSP-BiFPN sub-networks comprising a high level CSP2-n cell, a middle level first CSP2-n cell, a middle level second CSP2-n cell, and a low level CSP2-n cell, wherein,

the low-level CSP2-n unit performs feature fusion on the low-level semantic features and the intermediate fusion features subjected to the up-sampling operation to obtain the low-level semantic fusion features;

the middle-layer first CSP2-n unit performs feature fusion on the high-layer semantic features subjected to the up-sampling operation and the middle-layer semantic features to obtain middle fusion features;

the middle-layer second CSP2-n unit performs feature fusion on the middle-layer semantic features, the middle fusion features and the lower-layer semantic fusion features subjected to downsampling operation to obtain the middle-layer semantic fusion features;

and the high-level CSP2-n unit performs feature fusion on the high-level semantic features and the middle-level semantic fusion features subjected to the downsampling operation to obtain the high-level semantic fusion features.

In one embodiment of the present invention, the detection head includes a regression branch that outputs a target detection frame of the optical remote sensing sub-image to be detected and a classification branch that outputs a classification-to-intersection ratio of the optical remote sensing sub-image to be detected.

In one embodiment of the present invention, the YOLOv5 target detection model is obtained based on a plurality of training image samples and label training corresponding to each training image sample, wherein the labels comprise target coordinate labels and target classification labels.

In one embodiment of the invention, the object class label is a continuity number between 0 and 1.

In one embodiment of the present invention, the classification loss function of the YOLOv5 object detection model is:

Loss＝-|y-σ| ^β ((1-y)log(1-σ)+y logσ)；

wherein y represents a classification-to-cross ratio, sigma represents the output of a classification branch of the detection head, and beta represents a frequency modulation factor;

the calculation formula of the classification-intersection ratio of the training image sample is as follows: y=a×l;

wherein a represents an intersection ratio IoU of the predicted coordinates of the training image sample and the corresponding target coordinate label, the predicted coordinates are obtained by decoding the output of the regression branch of the detection head, and l represents the target classification label of the training image sample.

In one embodiment of the present invention, the step 4 includes:

step 4.1: filtering and re-processing the sub-image detection result;

step 4.2: and merging the sub-image detection results after the filtering and de-duplication treatment to obtain the detection result of the optical remote sensing image to be detected.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the method for detecting the optical remote sensing image target based on the YOLOv5, the optical remote sensing image target detection is completed by utilizing the trained YOLOv5 target detection model, the neck network of the YOLOv5 target detection model is a CSP-BiFPN network, and the two-way characteristic pyramid network BiFPN and the cross-level part connecting CSP convolution are combined. The BiFPN network introduces the importance of learning parameters to automatically learn the features of different downsampling magnifications, fully utilizes the information among the features of different downsampling magnifications, improves the distinguishing degree of the features of each level, and can better distinguish targets with different sizes in the optical remote sensing image. CSP convolution refers to a bottleneck connection mode, input features are divided into two features according to channels, the features are extracted through different convolution operations, the two features are spliced together according to the channels, finally, a convolution layer is connected to extract the features again, the feature extraction capacity is greatly enhanced, semantic fusion features with higher recognition degree are generated, and powerful support is provided for further distinguishing targets with different scales for the model.

2. According to the method for detecting the optical remote sensing image target based on the YOLOv5, the detection head classification branch of the YOLOv5 target detection model is converted from direct prediction type information into prediction classification-cross ratio, the relevance of classification branch output and regression branch output is further enhanced, and the new loss function guide model is used for combining the classification information and coordinate regression information, so that the detection precision of the model is improved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention, as well as the preferred embodiments thereof, together with the following detailed description of the invention, given by way of illustration only, together with the accompanying drawings.

Drawings

FIG. 1 is a schematic diagram of an optical remote sensing image target detection method based on YOLOv5 according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for detecting an optical remote sensing image target based on YOLOv5 according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a CSP-BiFPN subnetwork according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a CSP2-n unit according to an embodiment of the present invention.

Detailed Description

In order to further explain the technical means and effects adopted by the invention to achieve the preset aim, the following describes in detail an optical remote sensing image target detection method based on YOLOv5 according to the invention with reference to the attached drawings and the detailed description.

The foregoing and other features, aspects, and advantages of the present invention will become more apparent from the following detailed description of the preferred embodiments when taken in conjunction with the accompanying drawings. The technical means and effects adopted by the present invention to achieve the intended purpose can be more deeply and specifically understood through the description of the specific embodiments, however, the attached drawings are provided for reference and description only, and are not intended to limit the technical scheme of the present invention.

Example 1

Referring to fig. 1 and fig. 2, fig. 1 is a schematic diagram of a YOLOv 5-based optical remote sensing image target detection method according to an embodiment of the present invention; fig. 2 is a flowchart of a YOLOv 5-based optical remote sensing image target detection method according to an embodiment of the present invention. As shown in the figure, the YOLOv 5-based optical remote sensing image target detection method of the embodiment includes:

specifically, the optical remote sensing image to be detected is cut so that the cut image is adapted to the YOLOv5 target detection model, and in this embodiment, the optical remote sensing image to be detected is cut into 1024 x 1024 optical remote sensing sub-images to be detected.

Step 3: inputting an optical remote sensing sub-image to be detected into a pre-trained YOLOv5 target detection model to obtain a corresponding sub-image detection result, wherein the detection result comprises a target detection frame and a classification-intersection ratio;

in this embodiment, the YOLOv5 target detection model includes a cascaded backbone network, a neck network, and a detection head.

The backbone network is used for extracting features of the input optical remote sensing sub-image to be detected, and obtaining high-level semantic features, middle-level semantic features and low-level semantic features corresponding to the optical remote sensing sub-image to be detected.

The characteristics output by the network layer of the backbone network are divided into low-level semantic characteristics, middle-level semantic characteristics and high-level semantic characteristics according to different downsampling multiplying power. The lower-layer semantic features represent features output by a network layer with lower downsampling multiplying power, the higher-layer semantic features represent features output by a network layer with higher downsampling multiplying power, the middle-layer semantic features represent features output by a network layer with downsampling multiplying power between the sampling multiplying power of the lower-layer semantic features and the sampling multiplying power of the higher-layer semantic features, and the deeper the network extracted features have more semantic information.

In this embodiment, the lower-level semantic features, the middle-level semantic features, the higher-level semantic features, and the features representing the network-layer outputs of the backbone networks with downsampling magnifications of 8, 16, 32, respectively.

In this embodiment, the backbone network of the YOLOv5 target detection model is the backbone network of YOLOv5 itself, and the specific structure is not described here again.

Further, the neck network is used for fusing the high-level semantic features, the middle-level semantic features and the low-level semantic features to obtain high-level semantic fusion features, middle-level semantic fusion features and low-level semantic fusion features.

In this embodiment, the neck network of the YOLOv5 object detection model is a CSP-bippn network. The CSP-BiFPN network comprises a plurality of CSP-BiFPN subnetworks connected in series, wherein the CSP-BiFPN subnetworks comprise a high-level CSP2-n unit, a middle-level first CSP2-n unit, a middle-level second CSP2-n unit and a low-level CSP2-n unit.

In this embodiment, the CSP-BiFPN network includes three CSP-BiFPN sub-networks with the same structure, i.e., three CSP-BiFPN sub-networks with the same structure are connected in series to obtain a complete CSP-BiFPN network, and the structure of the CSP-BiFPN sub-network is shown in fig. 3.

Since the YOLOv5 backbone network outputs three layers of characteristics, namely, outputs: high-level semantic features, middle-level semantic features, and low-level semantic features. Therefore, in the neck network design of the present embodiment, corresponding three-layer fusion features are also output, namely, respectively: high-level semantic fusion features, middle-level semantic fusion features, and low-level semantic fusion features.

That is, the CSP-bippn sub-network receives as input three feature maps of different downsampling magnifications, and outputs a fused feature map of three different downsampling magnifications, where the downsampling magnifications between the input and output correspond one-to-one.

Specifically, for a CSP-BiFPN sub-network connected with a backbone network, a low-level CSP2-n unit thereof performs feature fusion on low-level semantic features and intermediate fusion features subjected to up-sampling operation to obtain low-level semantic fusion features; the middle-layer first CSP2-n unit performs feature fusion on the high-layer semantic features and the middle-layer semantic features subjected to the up-sampling operation to obtain middle fusion features; the middle layer second CSP2-n unit performs feature fusion on the middle layer semantic features, the middle fusion features and the lower layer semantic fusion features subjected to downsampling operation to obtain middle layer semantic fusion features; and the high-level CSP2-n unit performs feature fusion on the high-level semantic features and the middle-level semantic fusion features subjected to the downsampling operation to obtain the high-level semantic fusion features.

Taking the intermediate fusion feature as an example to describe a calculation formula of the intermediate fusion feature, the calculation formula of the intermediate fusion feature is as follows:

wherein CSP2-n represents the middle layer first CSP2-n unit, F _high Representing high-level semantic features of the input, F _mid Representing the mid-level semantic features of the input, ε=0.0001, UP represents a bilinear interpolation upsampling operation, w ₁ Representing the contribution weight of high-level semantic features to intermediate fusion features, w ₂ And the contribution weight of the middle semantic features to the middle fusion features is represented.

In the present embodiment, w ₁ And w ₂ Is YOLOv5The initial value of the learnable parameter in the training process of the target detection model is 1.

Similarly, the contribution weight of the low-level semantic features and the middle fusion features to the low-level semantic fusion features is required to be set; setting contribution weights corresponding to middle semantic fusion features, middle fusion features and lower semantic fusion features; and setting contribution weights of the high-level semantic features and the middle-level semantic fusion features to the high-level semantic fusion features. In this embodiment, the 7 weights are also used as the learnable parameters in the YOLOv5 target detection model training process, and the initial value is 1.

It should be noted that, the process of fusing the input features by other CSP-BiFPN subnetworks in the CSP-BiFPN network is similar to the above, and will not be repeated here. Similarly, the corresponding weight value needs to be set as a learnable parameter in the training process of the YOLOv5 target detection model, and the initial value is set to be 1.

Specifically, as shown in fig. 4, the structure diagram of the CSP2-n unit is in a bottleneck connection mode, and includes a first branch and a second branch, the input features are divided into two features according to channels, the two features are respectively input into the first branch and the second branch, the features input into the first branch are subjected to convolution operation by one CBH convolution layer, then subjected to convolution operation by 3 cascaded CBH convolution layers, and finally output by a two-dimensional convolution layer (Conv 2D); inputting the characteristics of the second branch and outputting the characteristics through convolution operation by a two-dimensional convolution layer; after the output features of the two branches are spliced in the channel dimension, the features are extracted again by sequentially inputting a BN layer (batch standardization layer), an L-ReLu layer and a CBH convolution layer.

In this embodiment, the CBH convolutional layer includes a concatenated two-dimensional convolutional layer, a BN layer, and an H-swish layer. The number of convolution kernels of the last CBH convolution layer in the CSP2-n unit is 256, the number of convolution kernels of the rest CBH convolution layers is 128, and the sizes of all the convolution kernels are 3*3.L-ReLU and H-swish represent activation functions.

The neck network of the YOLOv5 object detection model of this embodiment is a CSP-BiFPN network, combining a bi-directional feature pyramid network BiFPN with a cross-level part connection CSP convolution. The BiFPN network introduces the importance of learning parameters to automatically learn the features of different downsampling magnifications, fully utilizes the information among the features of different downsampling magnifications, improves the distinguishing degree of the features of each level, and can better distinguish targets with different sizes in the optical remote sensing image. CSP convolution refers to a bottleneck connection mode, input features are divided into two features according to channels, the features are extracted through different convolution operations, the two features are spliced together according to the channels, finally, a convolution layer is connected to extract the features again, the feature extraction capacity is greatly enhanced, semantic fusion features with higher recognition degree are generated, and powerful support is provided for further distinguishing targets with different scales for the model.

Further, the detection head is used for determining and outputting a detection result of the optical remote sensing sub-image to be detected according to the high-level semantic fusion feature, the middle-level semantic fusion feature and the low-level semantic fusion feature.

In this embodiment, the detection head includes a regression branch and a classification branch, the regression branch outputs a target detection frame of the optical remote sensing sub-image to be detected, and the classification branch outputs a classification-intersection ratio of the optical remote sensing sub-image to be detected. Wherein the classification-to-intersection ratio represents a joint distribution between the predicted class of the target and the intersection ratio of its predicted coordinates and true coordinates.

For better clarity of the solution, the following describes an exemplary training procedure of the YOLOv5 object detection model: firstly, a training data set is acquired, wherein the training data set comprises a plurality of training image samples, and target coordinate labels and target classification labels corresponding to the image samples. Note that, unlike YOLOv5, the original target class label is a discrete number of 0 and 1, and in this embodiment, the target class label corresponding to the training image sample is a continuous number between 0 and 1. Inputting a training data set into the above-described YOLOv5 target detection model, calculating a loss value of the YOLOv5 target detection model in training by using a loss function, and optimizing model parameters by using a random gradient descent (SGD) optimizer, wherein the model parameters comprise network structure parameters of the YOLOv5 and set contribution weight values for various fusion features; and when the calculated loss value is smaller than a preset threshold value after a certain batch of training image samples are input into the YOLOv5 target detection model, the YOLOv5 target detection model is considered to be converged, and training is completed.

In this embodiment, the classification loss function of the YOLOv5 target detection model is a mass focus loss function:

Loss＝-|y-σ| ^β ((1-y)log(1-σ)+y logσ)(2)

where y represents the class-to-cross ratio, σ represents the output of the class branch of the detection head, and β represents the frequency modulation factor, typically set to 2.

The calculation formula of the classification-cross ratio of the training image sample is as follows: y=a×l;

In the embodiment, the YOLOv5 detection head classification branch is converted into the prediction classification-cross ratio from the direct prediction type information, so that the relevance of classification branch output and regression branch output is further enhanced, and the original binary cross entropy loss function is replaced by the quality focusing loss function. The quality focusing loss function is designed for enabling the model to better learn the prediction classification-intersection ratio, and the model is guided to combine classification information and coordinate regression information, so that the accuracy of the model is improved.

Step 4: merging the sub-image detection results to obtain a detection result of the optical remote sensing image to be detected;

specifically, step 4 includes:

step 4.1: filtering and re-processing the sub-image detection result;

in this embodiment, for the sub-image detection result, the target with smaller prediction probability is filtered, the target with larger probability is retained, and the overlapped prediction target is removed by a non-maximum suppression algorithm.

According to the optical remote sensing image target detection method based on the YOLOv5, the optical remote sensing image target detection is completed by utilizing a trained YOLOv5 target detection model, and a neck network of the YOLOv5 target detection model is a CSP-BiFPN network, on one hand, the CSP-BiFPN network introduces a learnable parameter to automatically learn the importance of different downsampling magnification characteristics, fully utilizes the information among the different downsampling magnification characteristics, improves the distinguishing degree of each level of characteristics, and can better distinguish targets with different sizes in the optical remote sensing image; on the other hand, the CSP-BiFPN network greatly enhances the feature extraction capability, generates semantic fusion features with more discrimination, and provides powerful support for further distinguishing targets with different scales for the model.

In addition, according to the YOLOv 5-based optical remote sensing image target detection method, the classification branch of the YOLOv5 detection head is converted from direct prediction type information into prediction classification-intersection ratio, so that the relevance of classification branch output and regression branch output is further enhanced, and the accuracy of a model is improved.

It should be noted that in this document relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in an article or apparatus that comprises the element. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. The method for detecting the optical remote sensing image target based on the YOLOv5 is characterized by comprising the following steps of:

the YOLOv5 target detection model comprises a backbone network, a neck network and a detection head which are cascaded, wherein the neck network is a CSP-BiFPN network;

the backbone network is used for extracting the characteristics of the input optical remote sensing sub-image to be detected to obtain high-level semantic characteristics, middle-level semantic characteristics and low-level semantic characteristics corresponding to the optical remote sensing sub-image to be detected;

the neck network is used for fusing the high-level semantic features, the middle-level semantic features and the low-level semantic features to obtain high-level semantic fusion features, middle-level semantic fusion features and low-level semantic fusion features; the CSP-BiFPN network comprises a plurality of CSP-BiFPN subnetworks connected in series, wherein the CSP-BiFPN subnetworks comprise a high-level CSP2-n unit, a middle-level first CSP2-n unit, a middle-level second CSP2-n unit and a low-level CSP2-n unit,

the low-level CSP2-n unit performs feature fusion on the low-level semantic features and the intermediate fusion features subjected to the up-sampling operation to obtain the low-level semantic fusion features; the middle-layer first CSP2-n unit performs feature fusion on the high-layer semantic features subjected to the up-sampling operation and the middle-layer semantic features to obtain middle fusion features; the middle-layer second CSP2-n unit performs feature fusion on the middle-layer semantic features, the middle fusion features and the lower-layer semantic fusion features subjected to downsampling operation to obtain the middle-layer semantic fusion features; the high-level CSP2-n unit performs feature fusion on the high-level semantic features and the middle-level semantic fusion features subjected to downsampling operation to obtain the high-level semantic fusion features;

the detection head is used for determining and outputting a detection result of the optical remote sensing sub-image to be detected according to the high-level semantic fusion feature, the middle-level semantic fusion feature and the low-level semantic fusion feature; the detection head comprises a regression branch and a classification branch, the regression branch outputs a target detection frame of the optical remote sensing sub-image to be detected, and the classification branch outputs a classification-intersection ratio of the optical remote sensing sub-image to be detected.

2. The YOLOv 5-based optical remote sensing image target detection method of claim 1, wherein the YOLOv5 target detection model is obtained based on a plurality of training image samples and label training corresponding to each training image sample, wherein the labels comprise target coordinate labels and target classification labels.

3. The YOLOv 5-based optical remote sensing image object detection method of claim 2, wherein the object classification label is a consecutive number between 0 and 1.

4. The YOLOv 5-based optical remote sensing image target detection method of claim 2, wherein the classification loss function of the YOLOv5 target detection model is:

Loss＝-|y-σ| ^β ((1-y)log(1-σ)+ylogσ)；

5. The YOLOv 5-based optical remote sensing image target detection method according to claim 1, wherein the step 4 comprises:

step 4.1: filtering and re-processing the sub-image detection result;