CN115272242A

CN115272242A - YOLOv 5-based optical remote sensing image target detection method

Info

Publication number: CN115272242A
Application number: CN202210909740.1A
Authority: CN
Inventors: 侯彪; 李智德; 汤奇; 任仲乐; 任博; 杨晨; 焦李成
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-11-01
Anticipated expiration: 2042-07-29
Also published as: CN115272242B

Abstract

The invention relates to a method for detecting an optical remote sensing image target based on YOLOv5, which is characterized by comprising the following steps: step 1: acquiring an optical remote sensing image to be detected, wherein the optical remote sensing image to be detected contains a target to be detected; step 2: cutting the optical remote sensing image to be detected into a plurality of optical remote sensing subimages to be detected; and step 3: inputting the optical remote sensing subimages to be detected into a previously trained YOLOv5 target detection model to obtain corresponding subimage detection results, wherein the detection results comprise a target detection frame and a classification-intersection ratio; and 4, step 4: merging the sub-image detection results to obtain a detection result of the optical remote sensing image to be detected; the YOLOv5 target detection model comprises a backbone network, a neck network and a detection head which are connected in cascade, wherein the neck network is a CSP-BiFPN network. The method for detecting the optical remote sensing image target based on the YOLOv5 has higher detection precision and stronger capability of distinguishing the targets with different scales.

Description

YOLOv 5-based optical remote sensing image target detection method

Technical Field

The invention belongs to the technical field of optical remote sensing image airplane detection, and particularly relates to an optical remote sensing image target detection method based on YOLOv 5.

Background

Traditional target detection methods, which are usually designed based on manually extracted features, are usually specific to a specific scene and require a lot of parameter optimization, so that the methods have poor performance in generalization. In the face of optical remote sensing images with increasingly complex scenes, the traditional method is not applicable.

With the rapid development of deep learning, the generalization performance of the features extracted by using the convolutional neural network is far superior to that of the features extracted by the traditional manual method. Current object detection models typically include two parts: and part of the network is used for extracting features, and the network comprises a backbone network and a neck network, wherein the backbone network is usually pre-trained on a large-scale picture data set to obtain better generalization capability and stronger feature extraction function, and the neck network fuses feature layers with different down-sampling multiplying factors to identify and position targets with different sizes. The other part is a detection head, and classification and coordinate regression are carried out by utilizing the extracted feature network.

The target detection model can be divided into a single-stage model and a two-stage model according to whether a pre-extraction candidate region exists or not. The two-stage model is represented by fast-RCNN, the precision of the two-stage model is improved, meanwhile, the speed of inference is not as high as that of a single-stage model, and meanwhile, an Anchor frame Anchor needs to be set manually by an RPN network. The most representative models in the single-stage model are YOLO, SSD and RetinaNet. The method of the single-stage model is simple compared with the two-stage model, only classification and regression are carried out on the extracted feature map, and the method has the advantages that the reasoning speed is relatively high, but the precision is slightly worse than that of the two-stage model.

The neck networks adopted by most of current target detection models are a feature pyramid network FPN and a path aggregation network PAN, including a Yolov5 model, and the neck networks have the problems that input features with different down-sampling multiplying ratios are simply added, and the contribution degree of the input features to the final fusion features is not considered, so that the advantages of the fusion features are not fully exerted, and the capability of the model for distinguishing targets with different scales is reduced. In addition, the classification branch and the regression branch in the current target detection model detection head are usually independent and are not directly connected, so that when the predicted classification score is high, the predicted detection frame deviation is large, or when the predicted detection frame is accurate, the classification score is low, and the model precision is low.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an optical remote sensing image target detection method based on YOLOv 5. The technical problem to be solved by the invention is realized by the following technical scheme:

the invention provides an optical remote sensing image target detection method based on YOLOv5, which comprises the following steps:

step 1: acquiring an optical remote sensing image to be detected, wherein the optical remote sensing image to be detected contains a target to be detected;

step 2: cutting the optical remote sensing image to be detected into a plurality of optical remote sensing subimages to be detected;

and step 3: inputting the optical remote sensing subimages to be detected into a previously trained YOLOv5 target detection model to obtain corresponding subimage detection results, wherein the detection results comprise a target detection frame and a classification-intersection ratio;

and 4, step 4: merging the sub-image detection results to obtain a detection result of the optical remote sensing image to be detected;

the YOLOv5 target detection model comprises a backbone network, a neck network and a detection head which are connected in a cascade mode, wherein the neck network is a CSP-BiFPN network.

In an embodiment of the invention, the backbone network is configured to perform feature extraction on the input optical remote sensing subimage to be detected to obtain a high-level semantic feature, a middle-level semantic feature and a low-level semantic feature corresponding to the optical remote sensing subimage to be detected;

the neck network is used for fusing the high-level semantic features, the middle-level semantic features and the low-level semantic features to obtain high-level semantic fusion features, middle-level semantic fusion features and low-level semantic fusion features;

the detection head is used for determining and outputting the detection result of the optical remote sensing subimage to be detected according to the high-level semantic fusion characteristic, the middle-level semantic fusion characteristic and the low-level semantic fusion characteristic.

In one embodiment of the present invention, the CSP-bipfn network includes a plurality of CSP-bipfn subnetworks connected in series, the CSP-bipfn subnetworks including a higher level CSP2-n unit, a middle level first CSP2-n unit, a middle level second CSP2-n unit, and a lower level CSP2-n unit, wherein,

the low-level CSP2-n unit performs feature fusion on the low-level semantic features and the intermediate fusion features subjected to the up-sampling operation to obtain the low-level semantic fusion features;

the middle-layer first CSP2-n unit performs feature fusion on the high-layer semantic features subjected to the up-sampling operation and the middle-layer semantic features to obtain middle fusion features;

the middle layer second CSP2-n unit performs feature fusion on the middle layer semantic features, the middle fusion features and the low layer semantic fusion features subjected to down-sampling operation to obtain middle layer semantic fusion features;

and the high-level CSP2-n unit performs feature fusion on the high-level semantic features and the middle-level semantic fusion features subjected to down-sampling operation to obtain the high-level semantic fusion features.

In one embodiment of the invention, the detection head comprises a regression branch and a classification branch, the regression branch outputs a target detection frame of the optical remote sensing sub-image to be detected, and the classification branch outputs a classification-intersection ratio of the optical remote sensing sub-image to be detected.

In one embodiment of the present invention, the YOLOv5 target detection model is obtained based on a plurality of training image samples and a label training corresponding to each training image sample, where the label includes a target coordinate label and a target classification label.

In one embodiment of the invention, the object classification labels are consecutive numbers between 0 and 1.

In one embodiment of the present invention, the classification loss function of the YOLOv5 target detection model is:

Loss＝-|y-σ|^β((1-y)log(1-σ)+y logσ)；

wherein y represents the classification-cross-over ratio, σ represents the output of the classification branch of the detection head, and β represents the frequency modulation factor;

the calculation formula of the classification-intersection ratio of the training image samples is as follows: y = a × l;

wherein, A represents the intersection ratio IoU of the predicted coordinate of the training image sample and the corresponding target coordinate label, the predicted coordinate is obtained by decoding the output of the regression branch of the detection head, and l represents the target classification label of the training image sample.

In one embodiment of the present invention, the step 4 comprises:

step 4.1: filtering out and reprocessing the sub-image detection result;

and 4.2: and combining the sub-image detection results after the filtering and de-duplication processing to obtain the detection result of the optical remote sensing image to be detected.

Compared with the prior art, the invention has the beneficial effects that:

1. the method for detecting the target of the optical remote sensing image based on the YOLOv5 completes the target detection of the optical remote sensing image by utilizing a trained YOLOv5 target detection model, wherein a neck network of the YOLOv5 target detection model is a CSP-BiFPN network, and a bidirectional characteristic pyramid network BiFPN and a cross-level part connecting CSP convolution are combined. The BiFPN network introduces the importance of learning parameters for automatically learning different down-sampling multiplying power characteristics, makes full use of information among the different down-sampling multiplying power characteristics, improves the discrimination of each level characteristic, and can better distinguish targets with different sizes in optical remote sensing images. The CSP convolution refers to a mode of bottleneck connection, divides input features into two features according to channels, extracts the features of the input features through different convolution operations, then splices the two features according to the channels, and finally connects a convolution layer to extract the features again, so that the extraction capability of the features is greatly enhanced, semantic fusion features with more identification degrees are generated, and powerful support is provided for further distinguishing targets with different scales by the model.

2. According to the YOLOv 5-based optical remote sensing image target detection method, the detection head classification branch of the YOLOv5 target detection model is converted from direct prediction type information into prediction classification-intersection ratio, the relevance between the classification branch output and the regression branch output is further enhanced, and the classification information and the coordinate regression information are combined by using a new loss function guide model, so that the detection precision of the model is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.

Drawings

Fig. 1 is a schematic diagram of an optical remote sensing image target detection method based on YOLOv5 according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for detecting an optical remote sensing image target based on YOLOv5 according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a CSP-BiFPN subnetwork provided in the embodiment of the present invention;

fig. 4 is a schematic structural diagram of a CSP2-n unit according to an embodiment of the present invention.

Detailed Description

In order to further explain the technical means and effects of the present invention adopted to achieve the predetermined object, the following describes in detail an optical remote sensing image target detection method based on YOLOv5 according to the present invention with reference to the accompanying drawings and the detailed description.

The foregoing and other technical matters, features and effects of the present invention will be apparent from the following detailed description of the embodiments, which is to be read in connection with the accompanying drawings. The technical means and effects of the present invention adopted to achieve the predetermined purpose can be more deeply and specifically understood through the description of the specific embodiments, however, the attached drawings are provided for reference and description only and are not used for limiting the technical scheme of the present invention.

Example one

Referring to fig. 1 and fig. 2, fig. 1 is a schematic diagram of an optical remote sensing image target detection method based on YOLOv5 according to an embodiment of the present invention; fig. 2 is a flowchart of an optical remote sensing image target detection method based on YOLOv5 according to an embodiment of the present invention. As shown in the figure, the method for detecting an optical remote sensing image target based on YOLOv5 in the embodiment includes:

specifically, the optical remote sensing image to be detected is cut, so that the cut image is adapted to the YOLOv5 target detection model, and in this embodiment, the optical remote sensing image to be detected is cut into optical remote sensing sub-images to be detected with sizes of 1024 × 1024.

And step 3: inputting an optical remote sensing subimage to be detected into a YOLOv5 target detection model which is trained in advance to obtain a corresponding subimage detection result, wherein the detection result comprises a target detection frame and a classification-intersection ratio;

in this embodiment, the YOLOv5 target detection model includes a backbone network, a neck network, and a detection head in cascade.

The backbone network is used for extracting features of input optical remote sensing subimages to be detected to obtain high-level semantic features, middle-level semantic features and low-level semantic features corresponding to the optical remote sensing subimages to be detected.

The features output by the network layer of the backbone network are divided into low-level semantic features, middle-level semantic features and high-level semantic features according to the difference of the down-sampling multiplying power. The lower-layer semantic features represent features output by a network layer with a lower down-sampling multiplying power, the higher-layer semantic features represent features output by a network layer with a higher down-sampling multiplying power, the middle-layer semantic features represent features output by a network layer with a down-sampling multiplying power between the lower-layer semantic features and the sampling multiplying power of the higher-layer semantic features, and the deeper the network extracts the more semantic information.

In this embodiment, the low-level semantic feature, the middle-level semantic feature, the high-level semantic feature, and the feature representing the network layer output of the backbone network with the down-sampling magnification of 8, 16, and 32, respectively.

In this embodiment, the backbone network of the YOLOv5 target detection model is a backbone network of the YOLOv5 itself, and the specific structure is not described herein again.

Further, the neck network is used for fusing the high-level semantic features, the middle-level semantic features and the low-level semantic features to obtain high-level semantic fusion features, middle-level semantic fusion features and low-level semantic fusion features.

In this embodiment, the neck network of the YOLOv5 target detection model is a CSP-bipfn network. The CSP-BiFPN network comprises a plurality of CSP-BiFPN subnetworks which are connected in series, and each CSP-BiFPN subnetwork comprises a high-level CSP2-n unit, a middle-level first CSP2-n unit, a middle-level second CSP2-n unit and a low-level CSP2-n unit.

In this embodiment, the CSP-bipfn network includes three CSP-bipfn subnetworks with the same structure, that is, three CSP-bipfn subnetworks with the same structure are connected in series to obtain a complete CSP-bipfn network, and the structure of the CSP-bipfn subnetwork is shown in fig. 3.

Due to the fact that the backbone network of YOLOv5 outputs three-layer characteristics, namely, outputs: high-level semantic features, middle-level semantic features, and low-level semantic features. Therefore, the neck network design of the present embodiment also outputs the corresponding three-layer fusion features, that is, outputs: a high-level semantic fusion feature, a middle-level semantic fusion feature, and a low-level semantic fusion feature.

That is, the CSP-bipfn sub-network receives as input three feature maps of different down-sampling magnifications, and also outputs a fused feature map of three different down-sampling magnifications, where the down-sampling magnifications between the input and the output are in one-to-one correspondence.

Specifically, for a CSP-BiFPN sub-network connected with a backbone network, a low-level CSP2-n unit performs feature fusion on low-level semantic features and intermediate fusion features subjected to up-sampling operation to obtain low-level semantic fusion features; the middle-layer first CSP2-n unit performs feature fusion on the high-layer semantic features and the middle-layer semantic features subjected to the upsampling operation to obtain middle fusion features; the middle layer second CSP2-n unit performs feature fusion on the middle layer semantic features, the middle fusion features and the low layer semantic fusion features subjected to down-sampling operation to obtain middle layer semantic fusion features; and the high-level CSP2-n unit performs feature fusion on the high-level semantic features and the middle-level semantic fusion features subjected to the down-sampling operation to obtain high-level semantic fusion features.

The calculation formula of the intermediate fusion feature is described by taking the intermediate fusion feature as an example, and the calculation formula of the intermediate fusion feature is as follows:

wherein CSPs 2-n represent the middle first CSP2-n units, F_highRepresenting high-level semantic features of the input, F_midRepresents the middle level semantic features of the input, ε =0.0001, UP represents the bilinear interpolation upsampling operation, w₁Representing the contribution weight, w, of high-level semantic features to intermediate fusion features₂Representing the contribution weight of the middle-level semantic features to the middle-level fusion features.

In this embodimentIn, w₁And w₂The initial value of the learnable parameter in the training process of the Yolov5 target detection model is 1.

Similarly, the contribution weight of the low-level semantic features and the contribution weight of the intermediate fusion features to the low-level semantic fusion features are required to be set; setting contribution weights corresponding to the middle-layer semantic fusion features, the middle-layer semantic fusion features and the low-layer semantic fusion features; and setting contribution weights of the high-level semantic features and the middle-level semantic fusion features corresponding to the high-level semantic fusion features. In this embodiment, the above 7 weights are also used as learnable parameters in the YOLOv5 target detection model training process, and the initial value thereof is 1.

It should be noted that the process of fusing the input features by other CSP-bipfn subnetworks in the CSP-bipfn network is similar to that described above, and will not be described herein again. Similarly, a corresponding weight value needs to be set as a learnable parameter in the YOLOv5 target detection model training process, and the initial value is set to 1.

Specifically, a schematic structural diagram of the CSP2-n unit is shown in fig. 4, the CSP2-n unit is a mode of bottleneck connection, and includes a first branch and a second branch, an input feature is divided into two features according to a channel, the two features are respectively input into the first branch and the second branch, the feature input into the first branch is subjected to convolution operation by one CBH convolutional layer, then subjected to convolution operation by 3 cascaded CBH convolutional layers, and finally output by a two-dimensional convolutional layer (Conv 2D); inputting the characteristics of the second branch circuit, and performing convolution operation on the characteristics through a two-dimensional convolution layer to output the characteristics; and after the output characteristics of the two branches are spliced in the channel dimension, finally inputting the output characteristics into a BN layer (batch standardization layer), an L-ReLu layer and a CBH convolution layer in sequence to extract the characteristics again.

In this embodiment, the CBH convolutional layer includes a cascaded two-dimensional convolutional layer, a BN layer, and an H-swish layer. The convolution kernels of the last CBH convolutional layer in the CSP2-n unit are 256, the convolution kernels of the rest CBH convolutional layers are all 128, and the sizes of all the convolution kernels are 3 × 3.L-ReLU and H-swish represent activation functions.

The neck network of the YOLOv5 target detection model in this embodiment is a CSP-bipfn network, and the bidirectional feature pyramid network bipfn and the cross-level partial connection CSP convolution are combined. The BiFPN network introduces the importance of learning parameters for automatically learning different down-sampling multiplying power characteristics, makes full use of information among the different down-sampling multiplying power characteristics, improves the discrimination of each level characteristic, and can better distinguish targets with different sizes in optical remote sensing images. The CSP convolution refers to a mode of bottleneck connection, divides input features into two features according to channels, extracts the features of the input features through different convolution operations, then splices the two features according to the channels, and finally connects a convolution layer to extract the features again, so that the extraction capability of the features is greatly enhanced, semantic fusion features with more identification degrees are generated, and powerful support is provided for further distinguishing targets with different scales by the model.

Furthermore, the detection head is used for determining and outputting a detection result of the optical remote sensing subimage to be detected according to the high-level semantic fusion characteristic, the middle-level semantic fusion characteristic and the low-level semantic fusion characteristic.

In this embodiment, the detection head includes a regression branch and a classification branch, the regression branch outputs a target detection frame of the optical remote sensing subimage to be detected, and the classification branch outputs a classification-cross-over ratio of the optical remote sensing subimage to be detected. Wherein the class-intersection ratio represents a joint distribution between a predicted class of the object and an intersection ratio of its predicted coordinates and real coordinates.

For the sake of clarity, the following describes an exemplary training process of the YOLOv5 target detection model: firstly, a training data set is obtained, wherein the training data set comprises a plurality of training image samples, and target coordinate labels and target classification labels corresponding to the image samples. It should be noted that unlike YOLOv5, which is a discrete number of 0 and 1 as the original target classification label, in this embodiment, the target classification label corresponding to the training image sample is a continuous number between 0 and 1. Inputting a training data set into the above-described YOLOv5 target detection model, calculating a loss value of the YOLOv5 target detection model in training by using a loss function, and optimizing model parameters by using a random gradient descent (SGD) optimizer, wherein the model parameters comprise a network structure parameter of YOLOv5 and a set contribution weight value to each fusion feature; when the calculated loss value after a certain batch of training image samples are input into the YOLOv5 target detection model is smaller than a preset threshold value, the YOLOv5 target detection model is considered to be converged, and the training is completed.

In this embodiment, the classification loss function of the YOLOv5 target detection model is a mass focus loss function:

Loss＝-|y-σ|^β((1-y)log(1-σ)+y logσ)(2)

where y denotes a classification-cross ratio, σ denotes an output of a classification branch of the detection head, and β denotes a frequency modulation factor, and is normally set to 2.

In this embodiment, the YOLOv5 detection head classification branch is converted from direct prediction category information into a prediction classification-cross-merge ratio, so that the relevance between the classification branch output and the regression branch output is further enhanced, and the original binary cross entropy loss function is replaced by a quality focus loss function. The quality focus loss function is designed for enabling the model to better learn the prediction classification-intersection ratio, and the guide model combines the classification information and the coordinate regression information, so that the accuracy of the model is improved.

specifically, step 4 comprises:

step 4.1: filtering out the sub-image detection result to be processed again;

in this embodiment, for the sub-image detection result, the target with a smaller prediction probability is filtered out, the target with a larger probability is retained, and the overlapped prediction target is removed by a non-maximum suppression algorithm.

Step 4.2: and merging the sub-image detection results after the filtering and de-duplication processing to obtain the detection result of the optical remote sensing image to be detected.

According to the method for detecting the target of the optical remote sensing image based on the YOLOv5, the trained YOLOv5 target detection model is used for detecting the target of the optical remote sensing image, the neck network of the YOLOv5 target detection model is a CSP-BiFPN network, on one hand, learnable parameters are introduced into the CSP-BiFPN network to automatically learn the importance of different downsampling multiplying power characteristics, the information among the different downsampling multiplying power characteristics is fully utilized, the discrimination of each hierarchical characteristic is improved, and the targets with different sizes in the optical remote sensing image can be better distinguished; on the other hand, the CSP-BiFPN network greatly enhances the feature extraction capability, generates semantic fusion features with more identification degree, and provides powerful support for further distinguishing targets with different scales for the model.

In addition, in the method for detecting the target of the optical remote sensing image based on YOLOv5 of the embodiment, the YOLOv5 detection head classification branch is converted from direct prediction type information into a prediction classification-intersection ratio, and the relevance between the classification branch output and the regression branch output is further enhanced, so that the accuracy of the model is improved.

It should be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or device that comprises a list of elements does not include only those elements but may include other elements not expressly listed. Without further limitation, an element defined by the phrases "comprising one of \8230;" does not exclude the presence of additional like elements in an article or device comprising the element. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A method for detecting an optical remote sensing image target based on YOLOv5 is characterized by comprising the following steps:

the YOLOv5 target detection model comprises a backbone network, a neck network and a detection head which are cascaded, wherein the neck network is a CSP-BiFPN network.

2. The YOLOv 5-based optical remote sensing image target detection method according to claim 1,

the backbone network is used for extracting the characteristics of the input optical remote sensing subimages to be detected to obtain high-level semantic characteristics, middle-level semantic characteristics and low-level semantic characteristics corresponding to the optical remote sensing subimages to be detected;

3. The method for target detection based on YOLOv5 optical remote sensing image as claimed in claim 2, wherein the CSP-BiFPN network comprises a plurality of CSP-BiFPN sub-networks connected in series, the CSP-BiFPN sub-networks comprise a high-level CSP2-n unit, a middle-level first CSP2-n unit, a middle-level second CSP2-n unit and a low-level CSP2-n unit, wherein,

the middle-layer first CSP2-n unit performs feature fusion on the high-layer semantic features and the middle-layer semantic features subjected to the up-sampling operation to obtain middle fusion features;

and the high-layer CSP2-n unit performs feature fusion on the high-layer semantic features and the middle-layer semantic fusion features subjected to down-sampling operation to obtain the high-layer semantic fusion features.

4. The YOLOv 5-based optical remote sensing image target detection method as claimed in claim 2, wherein the detection head comprises a regression branch and a classification branch, the regression branch outputs a target detection frame of the optical remote sensing sub-image to be detected, and the classification branch outputs a classification-intersection ratio of the optical remote sensing sub-image to be detected.

5. The method for detecting the target of the YOLOv 5-based optical remote sensing image according to claim 1, wherein the YOLOv5 target detection model is obtained by training based on a plurality of training image samples and a label corresponding to each training image sample, and the label comprises a target coordinate label and a target classification label.

6. The YOLOv 5-based optical remote sensing image target detection method according to claim 5, wherein the target classification label is a continuity number between 0 and 1.

7. The method for detecting the target of the YOLOv 5-based optical remote sensing image as claimed in claim 5, wherein the classification loss function of the YOLOv5 target detection model is as follows:

Loss＝-|y-σ|^β((1-y)log(1-σ)+ylogσ)；

8. The method for detecting the target of the YOLOv 5-based optical remote sensing image as claimed in claim 1, wherein the step 4 comprises:

step 4.1: filtering and reprocessing the sub-image detection result;

and 4.2: and merging the sub-image detection results after the filtering and de-duplication processing to obtain the detection result of the optical remote sensing image to be detected.