CN112183414A

CN112183414A - Weak supervision remote sensing target detection method based on mixed hole convolution

Info

Publication number: CN112183414A
Application number: CN202011068687.4A
Authority: CN
Inventors: 陈苏婷; 邵东威; 张闯
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-01-05

Abstract

The invention provides a weak supervision remote sensing target detection method based on mixed hole convolution. The invention adopts various custom designs such as mixed cavity convolution, attention mechanism, multilayer pooling and the like, enhances multi-scale feature extraction and fusion, and improves the robustness to objects with different sizes. In addition, asynchronous iterative alternative training between the strong supervision detector and the weak supervision detector is utilized, and training and detection can be performed only by image-level real labels, so that the aim of cooperatively improving the detection performance is fulfilled.

Description

Weak supervision remote sensing target detection method based on mixed hole convolution

Technical Field

The invention relates to the field of pattern recognition, in particular to a weak supervision remote sensing target detection method based on mixed hole convolution.

Background

With the development and combination of the aviation technology and the computer vision technology, high-altitude high-resolution optical remote sensing images are easier to acquire and are applied to various fields. As a fundamental feature extraction problem in remote sensing image analysis, there has been considerable history of research in this field by academia. Specifically, the target of the remote sensing image target detection comprises the positioning of a ground object and the classification of object classes. In recent years, research results in the field of remote sensing image target detection are rapidly advanced, and a plurality of algorithms can simultaneously realize high-precision ground object positioning and identification work. Most of the image characteristic and target identification stages are decomposed into two stages, and according to the extracted characteristic types, the target detection method in the remote sensing image can be divided into a method based on traditional manual characteristics and a method based on deep learning.

The traditional target detection method facing remote sensing images can be roughly divided into three processes: firstly, selecting a region to be detected by using a sliding window, then extracting the characteristics of each selected region, and finally judging the object type contained in the region by using a classifier such as a support vector machine. However, the conventional method faces two major problems. On one hand, the sliding window scans the whole image, the pertinence is lacked, the time complexity is high, and a large number of redundant windows with characteristics to be extracted exist. On the other hand, because the information contained in the remote sensing image is very complex, the object types and sizes are diversified, and the edge difference between the object and the background, such as a city or a forest, is not very obvious, the semantic information of the object cannot be extracted by the traditional manual feature extraction algorithm based on image processing and machine learning, and the robustness of the remote sensing image target detection is poor.

The remote sensing image target detection method based on deep learning is an end-to-end model actually, and comprises a complete frame, wherein the frame simultaneously comprises an identification stage of an object in an image and a regression stage of an object detection frame. A plurality of regions containing potential objects of interest may be generated first using a region extractor. Then, a feature extractor extracts features of these regions of interest. Finally, according to the extracted features, the classifier generates the classes of the objects in the region of interest, and the position estimator can predict the positions of the objects more accurately. The method based on deep learning considers the features of the global scale, and utilizes the features of the full-connection layer to refine the position and the size of the candidate frame, so that the method has certain robustness on the size of an object in an image, but lacks natural robustness on the scale change of the object. Therefore, the existing work typically fuses feature maps of multiple scales to solve this problem, such as a Feature Pyramid Network (FPN). The method makes up for the loss of low-level visual features in the process of extracting high-level semantic features, and is beneficial to the feature learning of the network. However, this kind of method generally predicts the multi-scale feature maps separately, and the network becomes very complex and not beneficial for training.

In addition, another important problem in the field of target detection of remote sensing images is the lack of a marked data set, and the progress of remote sensing technology brings a large amount of high-resolution data, the image data simultaneously contains a large amount of target objects to be detected, and a large amount of manpower and material resources are needed to manually mark the detection frames of the objects in the images one by one.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention provides a weak supervision remote sensing target detection method based on mixed hole convolution. The invention designs a novel backbone network, which can greatly reduce the information loss in the feature extraction process, and introduces a channel attention module and a multilayer pooling module to further strengthen and fuse the extracted features. Meanwhile, a weak supervision learning mode is adopted, the training of a target detection task can be carried out under the condition that supervision information at the detection frame level is not needed, and the detection precision is cooperatively improved.

The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a weak supervision remote sensing target detection method based on mixed hole convolution comprises the following steps:

(1) acquiring a remote sensing image data set to be detected, and dividing the data set into a training set, a verification set and a test set according to a proportion;

(2) the method comprises the steps that a lossless residual error network is constructed by utilizing mixed hole convolution, and multi-scale features, namely low-level visual features and high-level semantic features, are extracted from a target object in a remote sensing image by using the lossless residual error network, so that a receptive field can cover the whole area, loss of edge information is avoided, and the robustness of the whole network to the multi-scale target in the remote sensing image is directly improved;

(3) sending the features extracted in the step (2) into a channel attention module, strengthening key feature information effective to a target detection task, and inhibiting invalid feature information;

(4) sending the features enhanced in the step (3) into a cascade multilayer pooling module for feature fusion to realize further fusion of low-level visual features and high-level semantic features, wherein the fused features are used as final output of a feature extraction network;

(5) sending the final characteristics obtained in the step (4) into a cooperative detection module, wherein the module is provided with two branches of a multi-instance learning branch and a detection frame regression branch, a weak supervision detection network WSDDN is used as the multi-instance learning branch to generate pseudo label information, a strong supervision detection network Fast R-CNN is used as the detection frame regression branch to realize more accurate target positioning, and the detection class probability and the detection frame of the target in the graph are used as the detection result of the module;

(6) calculating consistency errors of two branch training according to the detection results of the step (5), updating weight parameters of the two branch training simultaneously through a gradient descent algorithm, performing collaborative training, testing detection precision through a verification set, and continuously adjusting a network model until the precision meets expectations;

(7) and taking the trained network model as a detector, inputting the characteristics of the test set into the detector for detection, and obtaining a detection result, namely the probability and the detection frame of the target object in the remote sensing image.

Further, in the step (2), a lossless residual error network is constructed by using mixed hole convolution, and lossless multi-scale feature extraction is performed on the target in the remote sensing image, wherein the method comprises the following steps:

(2.1) based on ResNet-101, inserting 23 x3 hole convolutions with expansion rates of 2 and 5 respectively after standard 3x3 convolution in the original residual block to form a continuous hole convolution combination with expansion rates of 1,2 and 5, thereby constructing a new residual block, namely a lossless residual block. In addition, dense connection is added in the lossless residual block, namely the output of each cavity convolution layer is connected with the input characteristic and then input into the next cavity convolution layer, so that the bottom layer characteristic beneficial to target positioning is shared and reused. ResNet-101 here refers to a residual network with a depth of 101 layers.

(2.2) the first three stages of ResNet-101 are retained, and then 23 and 3 lossless residual blocks are stacked in the 4 th and 5 th stages, respectively, instead of the 4 th and 5 th stages in the original network. The stacking structure can improve the information utilization rate under the condition of keeping the size of the receiving field unchanged, effectively enhance the correlation between remote information and relieve the grid effect;

(2.3) stages 4 and 5 keep the same number of input channels as stage 3, i.e. 256 convolution kernels, and remove the downsampling operation so that the resolution of the output feature map remains at 1/8 of the original image.

Further, in step (3), the features extracted in step (2) are used as the input of the channel attention module to enhance the feature expression most relevant to the target location, and the specific process is as follows:

(3.1) features extracted for the 5 th stage in step (2.3)

The module performs convolution operation once by using C +1 convolution cores to obtain C +1 characteristic graphs

H, W and C herein represent the height, width and number of channels, respectively, of the feature map;

(3.2) decomposing the characteristics obtained in the step (3.1) on channel dimension to respectively obtain C characteristic graphs

And 1 single channel feature map

And to f₂Performing Sigmoid activation operation to obtain 1 channel attention matrix

The importance of each characteristic channel, namely the weight value, can be automatically reflected;

(3.3) attention matrix M and feature map f of the channel₁Respectively multiplying element by element, specifically, multiplying each pixel point by the corresponding weight in the attention matrix, further strengthening the important characteristics of the target detection task and inhibiting the unimportant characteristics, and finally obtaining the output characteristics

The mathematical expression of the whole module is

Wherein the content of the first and second substances,

representing element-by-element multiplication and sigma (#) represents a Sigmoid activation function.

Further, in the step (4), the output features of the step (3) are sent to a cascade multilayer pooling module, + realizing feature fusion of different layers, and the method is as follows:

(4.1) this module uses pooling layers with 6 different kernel sizes (1x1,2x2,4x4,8x8,10x2,2x20) for feature F obtained in step (3.3)_attentionPerforming multilevel pooling operation to obtain feature maps P at 6 different spatial scales_i＝{P₁,P₂,P₃,P₄,P₅,P₆And the 5 th inner core and the 6 th inner core are respectively an average pooling layer in the vertical direction and the horizontal direction, so that the design can capture long-strip-shaped target features which are difficult to detect in remote sensing images, such as bridges, ships and the like. The expression of the step is as follows:

wherein, P_iIndicating pooling characteristics, p_avg(*)andp_max(. x) represents average pooling and maximum pooling operations, respectively.

(4.2) compressing the channel number of the feature map extracted in the step (4.1) to the input feature F by using 1x1 convolution_attention1/8 for limiting the weight of the global feature in the subsequent feature fusion stage to obtain the intermediateFeature C_i＝{C₁,C₂,C₃,C₄,C₅,C₆}. The expression of the step is as follows: c_i＝f_conv(P_i) I ∈ {1, 2., 6}, where C is_iRepresents an intermediate feature, f_conv() denotes the convolution operation and i denotes the number of layers of the module.

(4.3) obtaining the intermediate characteristic C in the step (4.2)₁To C₆And original input features F_attentionPerforming first splicing on the channel dimension to obtain a fused feature F_concatAnd integrating the characteristic information of coarse granularity and fine granularity to make up the loss of spatial information caused by deepening the network. The expression of the step is as follows:

wherein, F_concatAnd F_attentionRespectively representing the fusion feature and the feature after attention-boost,

representing the join operation of the feature map in the channel dimension.

(4.4) feature F extracted in the 2 nd stage in the step (2.3)_stage2Down-sampling to step (4.3) fused feature F_concatSize of (d), and F_concatPerforming second splicing on the channel dimension, and performing convolution operation on the spliced features for three times to further promote the fusion of the low-layer high-resolution detail features and the high-layer semantic features to obtain the final output features F of the feature extraction module_out. The feature extraction module is a general term for lossless residual error network, channel attention module and cascade multi-layer pooling module. The expression of the step is as follows:

wherein, F_stage2And F_outRespectively representing the extracted features of the 2 nd stage in the step (2) and the final output features of the step (4).

Further, constructing a two-stage collaborative detection module with a multi-instance learning branch and a detection frame regression branch to train and detect the remote sensing images in the training set; the specific process is as follows:

(5.1) for each training or test image, using a selective search algorithm (SSW) to generate 2000 candidate frames of the target to be detected, and mapping each candidate frame to the final feature F output in the step (4)_OutNormalizing the feature map corresponding to each candidate frame by using a Space Pyramid Pool (SPP) to obtain pooling features with fixed sizes;

(5.2) accessing the pooled features obtained in the step (5.1) into two full-connection layers, converting the pooled features into feature vectors of all candidate frames, and respectively sending the feature vectors into two different branches: one branch outputs the probability that the target object belongs to each category according to the content of the candidate frame; and the other branch outputs the probability of various target objects contained in the candidate frame according to the position of the candidate frame, each branch consists of a full connection layer and a Softmax layer, the output matrixes of the two branches are multiplied element by element to obtain the category label of each candidate frame, and the calculation formula of the category label of each candidate frame is as follows:

in the above formula, P_jcA category label representing each of the candidate boxes,

representing the class probability that each candidate box j belongs to each class c,

representing the position probability of each candidate box j belonging to each category c;

(5.3) adding the class labels of all the candidate frames to obtain the prediction probability of each class of target object, wherein the prediction probability is used as an image-level prediction label of the whole remote sensing image, and the calculation formula of the prediction probability of each class of target object is as follows:

in the above-mentioned formula,

target class prediction result representing the entire picture, J_WRepresenting the number of candidate boxes. Then, calculating the cross entropy loss between the predicted label and the real label to iteratively update the training process of the WSDDN, wherein the calculation formula is as follows:

in the above formula, y_cThe true label representing the target class, y, is due to cross-entropy loss of the two classes_c∈{-1,1}；J_WRepresenting the number of candidate boxes and C representing the number of categories.

(5.4) when the loss in the step (5.3) exceeds a threshold, for example, the threshold is set to 0.5, extracting the weak supervised prediction result (namely, the pseudo label) with high confidence coefficient in the WSDDN, and providing a strong supervised prediction result calculation error obtained by Fast R-CNN as a real label, thereby realizing more accurate detection frame regression. In particular, the method of manufacturing a semiconductor device,

the final characteristic F obtained in the step (4.4)_outA spatial pyramid layer and two full-connection layers are accessed in the same way, and then two different branches (namely a classification branch and a regression branch) are sent in to respectively obtain predicted class probability p_icAnd a coordinate parameter t_icAs a result of a strongly supervised prediction.

(5.5) normalizing the cooperative training process of the two strong and weak supervised detection networks by using a joint loss function to obtain a final prediction result, wherein the specific process is as follows:

1) obtaining a prediction label { (p) of WSDDN and Fast R-CNN in the same remote sensing image_jc,t_jc) And { (p)_ic,t_ic)}；

2) Calculating the class loss L of the WSDDN for each candidate frame_cls；

3) Calculating the class loss L between WSDDN and Fast R-CNN for each candidate region_{cls_inter}And frame regression loss L_DIoUAnd the frame regression loss adopts distance cross-correlation ratio (DIoU) loss.

4) And carrying out weighted summation on the loss of the three parts to obtain a joint loss function of the cooperative detection network, wherein the specific formula is as follows:

above formula, J_WAnd J_SRespectively representing the number of candidate frames extracted by weak supervision and strong supervision; p is a radical of_jcAnd p_icRespectively representing the prediction classes of weak supervision and strong supervision, t_jcAnd t_icRespectively representing the coordinates of the weak supervision predicted position and the strong supervision predicted position; l is_{cls_inter}Representing the consistency of the class prediction between two detection networks under strong and weak supervision, L_clsRepresenting class prediction loss inside strong supervision; i is_ijIoU overlap of target object detection boxes extracted by two networks, and if I is greater than 0.5_ijIs 1, otherwise is 0; beta is a hyper-parameter between 0 and 1, which is used for balancing the consistency of the predictions of the strong and weak supervision networks, and the larger beta indicates that the strong supervision network trusts the positions of the target objects predicted by the weak supervision networks more. The last item in the loss function is used for restricting the consistency of the positions of the detection frames between two networks and preventing the detection frame difference of strong and weak supervision prediction from being too large. In addition, the border regression operation in the collaborative loss function adopts DIoU, and the calculation steps are as follows:

whereinIoU is the value of the portion of the two regions that overlap divided by the portion of the set of the two regions,

the representation can simultaneously cover the diagonal distance of the minimum closure area of the anchor frame and the real detection frame,

and the distance between the anchor frame and the center point of the real detection frame is represented. The calculation formula of the frame regression loss function based on the DIoU is as follows:

in summary, the overall loss function of the cooperative detection module is as follows:

L_total＝L_WSDDN+L_SSD

has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:

aiming at the problem of insufficient marking quantity of remote sensing data, the invention designs an end-to-end remote sensing target detection network combining a weak supervision detector and a strong supervision detector, constructs a combined loss function, and performs collaborative training, parameter sharing and synchronous promotion on the two, thereby remarkably improving the performance of training only by using an image-level label;

aiming at the characteristic of huge target scale difference in the remote sensing image, the invention designs a novel backbone network by utilizing mixed hole convolution, thereby greatly reducing information loss in the characteristic extraction process and realizing the full coverage of receptive field; an attention module and a cascade multi-layer pooling module are connected to the rear end of the system, so that the sensitivity of the network to scale change is effectively inhibited, and the capability of feature learning is further improved.

Aiming at the defect of the detection branch of Fast R-CNN in the frame regression stage, the invention defines a multitask loss function based on DIoU, and can improve the accuracy and convergence speed of frame regression.

Drawings

FIG. 1 is a training flow diagram of the present invention;

FIG. 2 is a block diagram of a network used in the present invention;

FIG. 3 is a schematic diagram of the test results obtained from the training of the present invention;

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

The invention relates to a cooperative learning-based weak supervision remote sensing image multi-target detection method, an algorithm framework is shown in figure 1, and the method comprises the following steps:

the remote sensing image data used in this embodiment are TGRS-HRRSD and DIOR data sets. Wherein, the TGRS-HRRSD comprises 21761 high-altitude images from google earth and hundredth maps in total, comprising 55740 target object instances of 13 classes; the DIOR comprises 20 types of 23463 specially-picked high-altitude remote sensing images, and the data set comprises 192472 target instances.

In this embodiment, a pytorech framework is adopted, and a programming experiment is performed by combining a python language, so that the pytorech can be regarded as a powerful deep neural network with an automatic derivation function. The data set is divided into a training set, a verification set and a test set, which are respectively used for training, verifying and testing the detection model, and the basic information is shown in table 1:

TABLE 1

Data set	Training set	Verification set	Test set
				TGRS-HRRSD	5401	5417	10943
DIOR	5862	5863	11738

(2) Constructing a lossless residual error network by using mixed cavity convolution, and performing lossless multi-scale feature extraction on a target in a remote sensing image, wherein the method comprises the following steps:

(2.1) based on ResNet-101, inserting 23 x3 hole convolutions with expansion rates of 2 and 5 respectively after standard 3x3 convolution in the original residual block to form a continuous hole convolution combination with expansion rates of 1,2 and 5, thereby constructing a new residual block, namely a lossless residual block. In addition, dense connection is added in the lossless residual block, namely the output of each cavity convolution layer is connected with the input characteristic and then input into the next cavity convolution layer, so that the bottom layer characteristic which has large influence on target positioning is shared and reused. ResNet-101 here refers to a residual network with a depth of 101 layers.

(2.2) the first three stages of ResNet-101 are retained, and then 23 and 3 lossless residual blocks are stacked in the 4 th and 5 th stages, respectively, instead of the 4 th and 5 th stages in the original network. The stacking structure can improve the information utilization rate under the condition of keeping the size of the receiving field unchanged, effectively enhances the correlation between remote information and relieves the grid effect.

(3) And (3) taking the features extracted in the step (2) as the input of a channel attention module to strengthen the feature expression most relevant to target positioning, wherein the method comprises the following steps:

(3.1) features extracted for step (2)

And 1 single channel feature map

(3.3) drawing M of channel attention and feature map f₂Element-by-element multiplication is carried out, the obtained importance degree is utilized to promote the characteristics and inhibit the characteristics which are not important to the target detection task, and finally the output characteristics are obtained

Further, the overall expression of step (3) is as follows:

wherein the content of the first and second substances,

(4) Sending the features enhanced in the step (3) into a cascade multi-layer pooling module to realize feature fusion of different layers, wherein the method comprises the following steps:

(4.1) this module uses pooling layers with 6 different kernel sizes (1x1,2x2,4x4,8x8,10x2,2x20) for feature F obtained in step (3.3)_attentionPerforming multilevel pooling operation to obtain feature maps P at 6 different spatial scales_i＝{P₁,P₂,P₃,P₄,P₅,P₆}. The 5 th inner core and the 6 th inner core are respectively an average pooling layer in the vertical direction and the horizontal direction, so that the design can capture strip-shaped target features which are difficult to detect in remote sensing images, such as bridges and ships.

(4.2) compressing the channel number of the feature map extracted in the step (4.1) to 1/8 of the original input channel by utilizing 1x1 convolution, wherein the channel number is used for limiting the weight of the global features in the subsequent feature fusion stage to obtain an intermediate feature C_i＝{C₁,C₂,C₃,C₄,C₅,C₆}。

(4.3) obtaining the intermediate characteristic C in the step (4.2)₁To C₆And original input features F_attentionPerforming first splicing on the channel dimension to obtain a fused feature F_concatAnd the detail features of the network with low layer and high resolution are fused with the semantic features of the network with high layer to make up the loss of spatial information caused by deepening the network.

(4.4) feature F extracted in the 2 nd stage in the step (2.3)_stage2Down-sampling to step (4.3) fused feature F_concarSize of (d), and F_concatPerforming second splicing on the channel dimension, and performing convolution operation on the spliced features for three times to further promote the fusion of the low-layer high-resolution detail features and the high-layer semantic features to obtain the final output features F of the feature extraction module_out. The feature extraction module is a general term for lossless residual error network, channel attention module and cascade multi-layer pooling module.

Further, the overall expression of step (4) is as follows:

C_i＝f_conv(P_i),i∈{1,2,...,6}

wherein, P_i、C_i、F_concat、F_stage2、F_attentionAnd F_outRespectively representing the pooling feature, the intermediate feature, the fusion feature, the feature extracted in the 2 nd stage in the step (2), the feature after attention enhancement and the final output feature in the step (4); p is a radical of_avg(*)andp_max(. x) represents average pooling and maximum pooling operations, respectively; f. of_conv() represents a convolution operation; ≧ denotes the connection operation of the feature map in the channel dimension, and i denotes the number of layers of the module.

(5) Constructing a two-stage collaborative detection module with a multi-instance learning branch and a detection frame regression branch to train and detect the remote sensing images in the training set, wherein the method comprises the following steps:

(5.1) for each training or test image, using a selective search algorithm (SSW) to generate 2000 candidate frames of the target to be detected, and mapping each candidate frame to the final feature F output in the step (4)_OutAnd then, normalizing the feature map corresponding to each candidate frame by using a Space Pyramid Pool (SPP) to obtain the pooled features with fixed sizes.

(5.2) accessing the pooled features obtained in the step (5.1) into two full-connection layers, converting the pooled features into feature vectors of all candidate frames, and respectively sending the feature vectors into two different branches: one branch outputs the probability that the target object belongs to each category according to the content of the candidate frame; and the other branch circuit outputs the probability of various target objects contained in the candidate frame according to the position of the candidate frame, each branch circuit consists of a full connection layer and a Softmax layer, and the output matrixes of the two branch circuits are multiplied element by element to obtain the class label of each candidate frame.

And (5.3) adding the category labels of all the candidate frames to obtain the prediction probability of each category target object, wherein the prediction probability is used as an image-level prediction label of the whole remote sensing image, and the cross entropy of the whole remote sensing image and a real label is used as a loss function of the WSDDN.

(5.4) when the loss in the step (5.3) exceeds a threshold, for example, the threshold is set to 0.5, extracting the weak supervised prediction result (i.e. pseudo label) with high confidence in the WSDDN, and providing the result as a real label to Fast R-CNN for more accurate detection box regression. Specifically, the final feature F obtained in step (4.4)_outA spatial pyramid layer and two full-connection layers are accessed in the same way, and then two different branches (namely a classification branch and a regression branch) are sent in to respectively obtain predicted class probability p_icAnd a coordinate parameter t_icAs a result of a strongly supervised prediction.

And (5.5) normalizing the cooperative training process of the two strong and weak supervised detection networks by using a joint loss function to obtain a final prediction result.

Further, the calculation formula of the category label of each candidate box in step (5.2) is:

representing the probability of the position of each candidate box j belonging to each category c.

Further, the calculation formula of the prediction probability of each category of target object in the step (5.3) is as follows:

in the above-mentioned formula,

target class prediction result representing the entire picture, J_WRepresenting the number of candidate boxes.

Further, the loss function of WSDDN is defined as:

Further, in step (5.5), a joint loss function is used to normalize the cooperative training process of the two strong and weak supervised detection networks, which is specifically as follows:

2) Calculating the class loss L of the WSDDN for each candidate frame_cls；

above formula, J_WAnd J_SRespectively representing the number of candidate frames extracted by weak supervision and strong supervision; p is a radical of_jcAnd p_icRespectively representing the prediction classes of weak supervision and strong supervision, t_jcAnd t_icRespectively representing the coordinates of the weak supervision predicted position and the strong supervision predicted position; l is_{cls_inter}Representing the consistency of the class prediction between two detection networks under strong and weak supervision, L_clsRepresenting class prediction loss inside strong supervision; i is_ijIoU overlap of target object detection boxes extracted by two networks, and if I is greater than 0.5_ijIs 1, otherwise is 0; beta is a hyper-parameter between 0 and 1, which is used for balancing the consistency of the predictions of the strong and weak supervision networks, and the larger beta indicates that the strong supervision network trusts the positions of the target objects predicted by the weak supervision networks more. The last item in the loss function is used for restricting the consistency of the positions of the detection frames between two networks and preventing the detection frame difference of strong and weak supervision prediction from being too large.

Further, the border regression operation in the collaborative loss function adopts DIoU, and the calculation steps are as follows:

where IoU is the value of the portion where the two regions overlap divided by the collective portion of the two regions,

and the distance between the anchor frame and the center point of the real detection frame is represented.

Further, the above-mentioned border regression loss function has the following formula:

further, the overall loss function of the cooperative detection module is as follows:

L_total＝L_WSDDN+L_SSD

in this embodiment, two data sets of TGRS-HRRSD and DIOR are tested, and some test results are shown in fig. 3. From experimental results, the detection precision of the method is obviously superior to that of other current weak supervision detection models, and a more comprehensive and compact bounding box prediction result can be generated. Meanwhile, compared with a partial strong supervision detection model, the method has great competitiveness on detection of certain classes.

The foregoing is a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A weak supervision remote sensing target detection method based on mixed hole convolution is characterized by comprising the following steps:

(2) constructing a lossless residual error network by using mixed hole convolution, and extracting multi-scale features, namely low-level visual features and high-level semantic features, of a target object in the remote sensing image by using the lossless residual error network;

2. The method for detecting the weakly supervised remote sensing target based on the mixed hole convolution as recited in claim 1, wherein in step (2), a lossless residual error network is constructed by using the mixed hole convolution, and lossless multi-scale feature extraction is performed on the target in the remote sensing image, and the method comprises the following steps:

(2.1) inserting 23 x3 hole convolutions with the expansion rates of 2 and 5 respectively after standard 3x3 convolution in an original residual block by taking ResNet-101 as a basic model to form a continuous hole convolution combination with the expansion rates of 1,2 and 5, thereby constructing a new residual block, namely a lossless residual block; dense connection is added in the lossless residual block, namely the output and input characteristics of each cavity convolution layer are connected and then input into the next cavity convolution layer, so as to share and reuse the bottom layer characteristics beneficial to target positioning;

(2.2) reserving the first three stages of ResNet-101, and then stacking 23 lossless residual blocks and 3 lossless residual blocks in the 4 th stage and the 5 th stage respectively to replace the 4 th stage and the 5 th stage in the original network;

3. The method for detecting the weakly supervised remote sensing target based on the mixed hole convolution as recited in claim 2, wherein in the step (3), the features extracted in the step (2) are sent to a channel attention module to strengthen key feature information effective to a target detection task and inhibit invalid feature information, and the specific method is as follows:

(3.1) features extracted for the 5 th stage in step (2.3)

And 1 single channel feature map

(3.3) attention matrix M and feature map f of the channel₁Respectively multiplying element by element, namely multiplying each pixel point by the corresponding weight in the attention matrix to finally obtain the output characteristics

The mathematical expression of the whole module is

Wherein the content of the first and second substances,

4. The method for detecting the weakly supervised remote sensing target based on the mixed hole convolution is characterized in that in the step (4), the output features of the step (3) are sent to a cascade multilayer pooling module to realize feature fusion of different layers, and the method comprises the following steps:

(4.1) the module uses pooling layers with 6 different kernel sizes 1x1,2x2,4x4,8x8,10x2,2x20 for the feature F obtained in step (3.3)_attentionPerforming multilevel pooling operation to obtain feature maps F at 6 different spatial scales_i＝{P₁，P₂，P₃，P₄，P₅，P₆And the 5 th kernel and the 6 th kernel are respectively an average pooling layer in the vertical direction and the horizontal direction, and the expression of the step is as follows:

wherein, P_iIndicating pooling characteristics, p_avg(*)and P_max(. x) represents average pooling and maximum pooling operations, respectively;

(4.2) compressing the channel number of the feature map extracted in the step (4.1) to the input feature F by using 1x1 convolution_attention1/8, for limiting the weight of the global feature in the subsequent feature fusion stage to obtain the intermediate feature C_i＝{C₁，C₂，C₃，C₄，C₅，C₆The expression of the step is as follows: c_i＝f_conv(P_i) I ∈ {1, 2., 6}, where C is_iRepresents an intermediate feature, f_conv(*)Representing a convolution operation, i represents the number of layers of the module;

(4.3) obtaining the intermediate characteristic C in the step (4.2)₁To C₆And original input features F_attentionPerforming first splicing on the channel dimension to obtain a fused feature F_concatThe expression of the step is as follows:

representing the connection operation of the feature map on the channel dimension;

(4.4) feature F extracted in the 2 nd stage in the step (2.3)_stage2Down-sampling to step (4.3) fused feature F_concatSize of (d), and F_concatPerforming second splicing on the channel dimension, and performing convolution operation on the spliced features for three times to obtain the final output feature F of the feature extraction module_OutThe expression of the step is as follows:

5. The method for detecting the weakly supervised remote sensing target based on the mixed hole convolution is characterized by comprising the following steps of (5) constructing a two-stage collaborative detection module with a multi-instance learning branch and a detection frame regression branch to train and detect the remote sensing image in a training set; the specific process is as follows:

(5.1) for each training or test image, using a selective search algorithm (SSW) to generate 2000 candidate frames of the target to be detected, and mapping each candidate frame to the final feature F output in the step (4)_OutThen, normalizing the feature map corresponding to each candidate frame by using a space pyramid pool to obtain pooled features with fixed sizes;

in the above-mentioned formula,

target class prediction result representing the entire picture, J_WRepresenting the number of candidate boxes, and then calculating the cross entropy loss between the predicted label and the real label to iteratively update the training process of the WSDDN, wherein the calculation formula is as follows:

in the above formula, y_cThe true label representing the target class, y, is due to cross-entropy loss of the two classes_c∈{-1，1}；J_WRepresenting the number of candidate boxes, C representing the number of categories;

(5.4) when the loss in the step (5.3) exceeds a threshold value, extracting a weak supervised prediction result with high confidence coefficient in the WSDDN, namely a pseudo label, as a calculation error of a strong supervised prediction result obtained by a real label and Fast R-CNN,

the final characteristic F obtained in the step (4.4)_outA space pyramid layer and two full connection layers are accessed in the same way, and then two different branches, namely a classification branch and a regression branch, are sent in to respectively obtain the predicted class probability p_icAnd a coordinate parameter t_icAs a prediction result of strong supervision;

1) obtaining a prediction label { (p) of WSDDN and Fast R-CNN in the same remote sensing image_jc，t_jc) And { (p)_ic，t_ic)}；

2) Calculating the class loss L of the WSDDN for each candidate frame_cls；

3) Calculating the class loss L between WSDDN and Fast R-CNN for each candidate region_{cls_inter}And frame regression loss L_DIoUWherein, the frame regression loss adopts distance cross ratio DIoU loss;

above formula, J_WAnd J_SRespectively representing the number of candidate frames extracted by weak supervision and strong supervision; p is a radical of_jcAnd p_icRespectively representing the prediction classes of weak supervision and strong supervision, t_jcAnd t_icRespectively representing the coordinates of the weak supervision predicted position and the strong supervision predicted position; l is_{cls_inter}Representing the consistency of the class prediction between two detection networks under strong and weak supervision, L_clsRepresenting class prediction loss inside strong supervision; i is_ijWhen the overlap of IoU representing the target object detection boxes extracted by two networks is greater than 0.5, I is_ijIs 1, otherwise is 0; beta is a hyper-parameter between 0 and 1 to balance the consistency of strong and weak supervision network prediction, the last item in the loss function is used to restrict the consistency of the positions of the detection boxes between two networks, and in addition, the border regression operation in the collaborative loss function adopts DIoU, and the calculation steps are as follows:

indicating anchor frame and true checkMeasuring the distance between the center points of the frames, and calculating a frame regression loss function based on the DIoU according to the following formula:

L_total＝L_WSDDN+L_SSD。