CN115115938A

CN115115938A - Method for detecting salient target of remote sensing image

Info

Publication number: CN115115938A
Application number: CN202210879580.0A
Authority: CN
Inventors: 夏鲁瑞; 蔺崎辉; 李森; 陈雪旗; 卢妍; 张占月; 王鹏
Original assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Current assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2022-09-27

Abstract

The invention discloses a method for detecting a salient target of a remote sensing image, which comprises the following steps: s1, acquiring remote sensing image data containing a training set and a test set, and constructing a remote sensing image salient object detection model comprising a detection feature encoder and a cascade feature decoder; s2, introducing an attention mechanism, a characteristic flow mechanism and a cascade decoding mechanism, training the remote sensing image salient target detection model based on remote sensing image data of a training set, stopping training until a preset loss function is converged, and obtaining the trained remote sensing image salient target detection model; and S3, carrying out salient target prediction on the remote sensing image data of the test set by using the trained remote sensing image salient target detection model, and further outputting a corresponding salient map. The method is based on the cascade structure to decode the features, so that the problems of small target omission and false detection in the remote sensing image are solved, the prediction confidence of the salient region is improved, and more accurate salient target boundaries can be predicted.

Description

Method for detecting salient target of remote sensing image

Technical Field

The invention mainly relates to the technical field of remote sensing image application, in particular to a method for detecting a salient target of a remote sensing image.

Background

With the explosive increase of the data volume of remote sensing images, the traditional remote sensing image utilization method of artificial visual interpretation cannot meet the practical requirements, so that the development of an intelligent interpretation method for the remote sensing images is urgently needed. As an important preprocessing step of computer vision, salient object detection has achieved good results in natural scenes. However, due to the characteristics of different shooting angles, various ground feature types, complex background and the like of the remote sensing scene, the method for detecting the obvious target of the remote sensing image is still less. Meanwhile, in the process of detecting the obvious target in the remote sensing image, the existing method has poor detection effect on the edge area of the obvious target, and has a certain distance from the realization of application on the conditions that the small target is easy to have wrong detection, missing detection and the like.

Disclosure of Invention

In view of the above, the present invention provides a method for detecting a significant target in a remote sensing image, where the method adopts an encoder-decoder structure, introduces an attention mechanism, a feature flow mechanism, and a cascade decoding mechanism, and designs a new loss function to train a target detection model, so as to detect a significant target in a remote sensing image through the trained target detection model, thereby effectively improving a detection effect of a significant target edge of a remote sensing image, and improving conditions such as missing detection and error detection of a small target.

The invention discloses a method for detecting a salient target of a remote sensing image, which comprises the following steps:

s1, acquiring remote sensing image data containing a training set and a test set, and constructing a remote sensing image salient object detection model comprising a detection feature encoder and a cascade feature decoder;

s2, introducing an attention mechanism, a characteristic flow mechanism and a cascade decoding mechanism, training the remote sensing image salient target detection model based on remote sensing image data of a training set, stopping training until a preset loss function is converged, and obtaining the trained remote sensing image salient target detection model;

and S3, carrying out salient target prediction on the remote sensing image data of the test set by using the trained remote sensing image salient target detection model, and further outputting a corresponding salient map.

Further, the detection feature encoder is a dense attention flow encoder, which is obtained by improving a VGG16 network as a backbone network, and the improvement process is as follows: the last three fully connected layers of the VGG16 network are removed and truncated before the last pooling layer of VGG16 network, resulting in the dense attention flow encoder.

Further, the specific implementation manner of step S2 includes:

s21, introducing an attention mechanism, extracting output features of the last layer of each part from the improved VGG16 network, combining output feature dimensions based on a preset spatial pixel relation matrix to construct an operation matrix among pixels, and further realizing the representation of the relation among the pixels;

s22, performing normalization processing based on the operation matrix among the pixels to obtain attention weight, and multiplying the output features after dimensionality combination with the attention weight to obtain features after self attention weighting by using space;

s23, adding the output features and the features weighted by using the space attention by using a residual error connection mode, and obtaining output deep features through a connection channel attention mechanism, wherein the process is expressed as follows by a formula:

F＝CA(f+δ·(f*(Re ^-1 (Re(f)⊙R))))

in the formula, Re ^-1 Expressing the inverse operation of output feature merging dimension, R expressing a pixel relation matrix, representing element-by-element multiplication, delta expressing a learnable coefficient, CA (DEG) expressing a channel attention mechanism, and f expressing the initial feature of the output of the main network;

s24, performing upsampling and 1 x 1 convolution on the deep features to adjust the size and the channel of the deep features and the current features to be consistent;

s25, splicing the deep features subjected to upsampling and 1 x 1 convolution with the current features from the next layer of the current features according to the sequence from the shallow layer to the deep layer based on a preset step-by-step splicing module;

s26, adjusting the channel number of the spliced features into the channel number of the deep features output by the detection feature encoder, and inputting the channel number into a cascade feature decoder for decoding;

and S27, activating the final output of the cascade feature decoder by using a Sigmoid function, and further completing the training of the remote sensing image salient target detection model.

Further, the preset spatial pixel relationship matrix in step S21 is formulated as:

M＝{(Re(f)) ^T ⊙Re(f)} ^T

where Re () denotes an operation of merging the last two dimensions of the output characteristic into one dimension, which indicates a matrix multiplication operation, and T denotes a transpose.

Further, the normalization process based on the operation matrix between the pixels in step S22 is formulated as:

in the formula, r (x, y) represents the degree of importance of the influence of the pixel x on the pixel y, m (x, y) represents an element in the pixel relationship matrix, and e represents a natural constant.

Further, the method further includes step S23', and the information extraction is performed on the output features by using multi-level pyramid fusion multi-scale spatial attention, specifically: updating the output features into three channels with different resolutions by 2-time and 4-time down sampling, refining the features with different scales by using multi-scale spatial attention fused by a multi-level pyramid, fusing the refined features and the output features based on a residual structure, then fusing the three-level features according to the sequence of resolution from low to high, further obtaining deep features of the multi-scale spatial attention weight fused by the multi-level pyramid, and finally merging the deep features output by the step S23 and the deep features of the multi-scale spatial attention weight fused by the multi-level pyramid.

Further, in step S25, the deep features after upsampling and 1 × 1 convolution are spliced with the current features, and are expressed by the following formula:

F _k ＝Conc(Conv(Up(f ₅ ))，...Conv(Up(F _k-1 ))，F _k )

where Up (-) denotes upsampling aligns the deep feature with the current feature, F _k Representing the kth stage characteristic, F, fed to the concatenated decoder ₅ Representing the 5 th level features fed into the concatenated decoder, Conv representing the convolutional layer.

Further, the preset loss function is a combined loss function with different weight coefficients, and is formulated as:

L＝ω ₁ L _P +ω ₂ L _R +ω ₃ L _MAE +ω ₄ L _S

in the formula, L _P 、L _R 、L _MAE And L _s Respectively representing a precision loss term, a recall loss term, an average absolute error loss term and a structural similarity degree loss term, omega ₁ 、ω ₂ 、ω ₃ 、ω ₄ Respectively represent L _P 、L _R 、L _MAE And L _s Wherein:

L _S ＝1-S _measure

S _measure ＝α×S _o +(1-α)×S _r

in the formula, N is the total number of samples, N represents a sample serial number, J represents a high-direction pixel serial number of the remote sensing image, i represents a wide-direction pixel serial number of the remote sensing image, epsilon represents a preset constant, W and H are the width and the height of the remote sensing image respectively, S (i, J) e S represents a predicted value of each pixel, G (i, J) e G represents a true value of each pixel, S represents a significance prediction result, G represents a real label, and S represents a real label _r For region-oriented similarity measurement, S _o For the similarity measure facing the object structure, α represents a hyper-parameter, which is used to measure the similarity measure facing the region and the similarity measure facing the object structure.

Further, step S4 is included, comparing the output corresponding saliency map with the truth map, and further measuring the level of the saliency map generated by the remote sensing image saliency target model.

Further, the specific implementation manner of step S4 is: and measuring a saliency map generated by the remote sensing image saliency target model based on a preset index PR curve, an F value, an average absolute loss and an S value.

Compared with the prior art, the method for detecting the salient target of the remote sensing image has the following advantages that:

(1) the method uses the cascade structure to decode the features, so that more high-level semantic features can guide the feature decoding process, and the problems of missed detection and false detection of small targets in the remote sensing image are effectively solved.

(2) The invention designs a new loss function for training a remote sensing image salient target detection model, improves the prediction confidence coefficient of a salient region, and enables the model to predict more accurate salient target boundaries.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of a method for detecting a salient object in a remote sensing image according to the present invention;

FIG. 2 is a schematic structural diagram of a method for detecting a salient object in a remote sensing image according to the present invention;

FIG. 3 is a schematic diagram of the self-attention mechanism of the present invention;

FIG. 4 is a graph of P-R curve results for an embodiment of the present invention;

FIG. 5 is a diagram of the results of improved false detection and missed detection of small targets, wherein (a) is a remote sensing image, (b) is a true value diagram of a salient target, (c) is an originally generated salient diagram, and (d) is a salient diagram generated by the method;

fig. 6 is a diagram showing the result of improving the prediction effect of the boundary region of the salient object, where (a) is a remote sensing image, (b) is a true value diagram of the salient object, (c) is an originally generated salient image, and (d) is a generated salient image according to the method.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to fig. 1 to 6, the method for detecting a salient target in a remote sensing image of the invention comprises the following steps:

in this step, the detection feature encoder is a dense attention flow encoder, which is obtained by improving a VGG16 network as a backbone network, and the improvement process is as follows: removing the last three fully connected layers of the VGG16 network and pooling the layers at the last level of the VGG16 networkForward truncation, and thereby obtaining the dense attention flow encoder. Therefore, the characteristic dimension of the first four layers in the improved VGG16 network is

W and H are width and height of the remote sensing image respectively, k is a backbone network layer, the last layer is provided with a pooling layer removed, so that the characteristic dimension of the last layer is consistent with that of the fourth layer, and after each characteristic extraction layer is finished, current-level characteristics are selected and refined and then sent to the next level for extraction.

In this embodiment, EORSSD is used as a remote sensing image data set, 1400 remote sensing images are randomly selected from the remote sensing image data set as a training set, and 600 remote sensing images are used as a test set.

in this step, it is specifically:

wherein the preset spatial pixel relationship matrix is expressed by a formula as:

M＝{(Re(f)) ^T ⊙Re(f)} ^T

where Re () denotes an operation of merging the last two dimensions of the output characteristic into one dimension, which indicates a matrix multiplication operation, and T denotes a transposition;

s22, performing normalization processing based on an operation matrix among pixels to obtain attention weight, multiplying the output features after dimensionality combination with the attention weight, and further restoring the temperature of the output features to obtain features after self-attention weighting by using space, wherein the obtained features after self-attention weighting by using space have global information;

the normalization processing based on the operation matrix among the pixels is expressed by a formula as follows:

in the formula, r (x, y) represents the influence importance degree of the pixel x on the pixel y, m (x, y) represents an element in the pixel relation matrix, and e represents a natural constant;

F＝CA(f+δ·(f*(Re ^-1 (Re(f)⊙R))))

in the formula, Re ^-1 Representing the inverse operation of merging dimensionality of output features, R representing a pixel relation matrix, x representing element-by-element multiplication, delta representing a learnable coefficient, CA (-) representing a channel attention mechanism, and f representing the initial feature of the output of the main network;

in this embodiment, after the improved VGG16 network feature extraction and feature refinement are performed on the remote sensing image, five features with different scales are finally formed, and deeper layers of the five features with different scales include more semantic features, and shallower layers retain more detailed features.

s25, based on the preset step-by-step splicing module, splicing the deep features after upsampling and 1 x 1 convolution with the current features from the next layer of the current features according to the sequence from shallow to deep, wherein the splicing process is expressed by a formula as follows:

F _k ＝Conc(Conv(Up(F ₅ ))，...Conv(Up(F _k-1 ))，F _k )

wherein Up (. circle.) represents aboveSampling aligns deep features with current features, F _k Representing the kth stage characteristic, F, fed to the concatenated decoder ₅ Representing the 5 th level features fed into the concatenated decoder, Conv representing the convolutional layer;

in this embodiment, in order to realize complete extraction of image features, a multi-level feature is fused by an attention fusion method of cascading from a shallow layer to a deep layer, for example, a GCA4 module, a GCA4 module receives and splices output features of a GCA1 module, a GCA2 module, and a GCA3 module, and adjusts the number of channels to 1 again, so as to form a final attention map, which is expressed by a formula:

A ₄ ＝Conv(Conc(A1，A2，A3，A4))

the attention map is then multiplied by the refined features and residual concatenation is used to generate the deep features that are fed into the concatenated feature decoder.

In the embodiment, because the deep features have the richest semantic features, each level of decoder can be guided to decode; however, all features from deeper layers have more semantic information than features from shallow layers, and by using the cascaded feature decoder, except the deepest global features, the deep-layer features obtained by each level of encoder can guide the shallow-layer decoder, thereby facilitating the generation of the final saliency map. Each decoder unit receives the output from the upper-level decoder and the deeply spliced features, and activates the output of the last decoder by using a Sigmoid function to obtain a final predicted saliency map.

In this embodiment, the salient target prediction is performed on the remote sensing image data of the test set based on the trained remote sensing image salient target detection model, so that a more accurate salient map of the salient target boundary can be obtained.

Wherein, the preset loss function is a combined loss function with different weight coefficients, and is expressed by a formula as follows:

L＝ω ₁ L _P +ω ₂ L _R +ω ₃ L _MAE +ω ₄ L _S

L _S ＝1-S _measure

S _measure ＝α×S _o +(1-α)×S _r

in the formula, N is the total number of samples, N represents a sample serial number, j represents a high-direction pixel serial number of the remote sensing image, i represents a wide-direction pixel serial number of the remote sensing image, epsilon represents a preset constant, W and H are the width and the height of the remote sensing image respectively, S (i, j) e S represents a predicted value of each pixel, G (i, j) e G represents a true value of each pixel, S represents a significance prediction result, G represents a real label, and S represents a real label _r Being area-orientedSimilarity measure, S _o For the similarity measure facing the object structure, α represents a hyper-parameter, which is used to measure the similarity measure facing the region and the similarity measure facing the object structure.

In this embodiment, since the difference of the images obtained by using the structural similarity to compare the structural information between the images better conforms to the perception result of human eyes, the problem that the detection capability of the cross entropy loss function on the edge portion in the process of detecting the salient object is not strong can be solved by using the combined loss function with different weight coefficients as the preset loss function.

In another embodiment, the method further includes step S4, comparing the output corresponding saliency map with the truth map, and further measuring a level of the saliency map generated by the remote sensing image saliency target model, specifically: and measuring a saliency map generated by the remote sensing image saliency target model based on a preset index PR curve, an F value, an average absolute loss and an S value.

In this embodiment, the saliency map generated by the model is compared with the truth map to quantitatively measure the level of saliency map generation. Specifically, four indexes, i.e., a PR curve, an F value, a mean absolute loss (MAE) and an S value, were used for evaluation.

Wherein Precision refers to the proportion of the positive samples with correct prediction to all positive samples, namely Precision; recall is the ratio of the positive sample in the label to be predicted correctly, namely Recall ratio, all (Precision, Recall) values can be obtained by adjusting the threshold value between (0,1), and a Precision-Recall (PR) curve can be obtained by connecting the values in sequence, so that the closer the PR curve is to the (1,1) point of the coordinate axis, the better the performance of the model is represented; as shown in fig. 4, fig. 4 shows a PR curve of the method for detecting a salient object in a remote sensing image in this embodiment.

Wherein the value of F is defined as

In the formula, beta ² Set to 0.3 to emphasize Precision importance;

the mean absolute loss (MAE) is an index for measuring the absolute error of the significant prediction diagram and the truth diagram, and is expressed by the formula:

in the formula, S represents a significance prediction result, and G represents a true tag.

S _measure The value is an index for measuring the generated saliency map from the structural similarity, and is formulated as:

S _measure ＝α×S _o +(1-α)×S _r

in the formula, S _r For region-oriented similarity measurement, S _o For similarity measurements towards object structures, α represents a hyper-parameter.

In this embodiment, the method for detecting the salient object of the remote sensing image is used to detect the salient object of the test set, and the detection results are shown in table 1, fig. 5 and fig. 6.

Table 1 shows the detection results of the salient objects in the remote sensing images

Evaluation index	F	MAE	S
				Value of	0.9031	0.0048	0.9189

As can be seen from fig. 5 and 6, the method for detecting the salient object in the remote sensing image can accurately predict the salient object and the boundary region prediction result of the salient object, and can accurately predict the salient object in the small object scene, thereby reducing the situations of missing detection and false detection.

In another embodiment, the method further includes step S23', and performing information extraction on the output features by using multi-level pyramid to fuse multi-scale spatial attention, specifically: updating the output features into three channels with different resolutions by 2-time and 4-time down sampling, refining the features with different scales by using multi-scale spatial attention fused by a multi-level pyramid, fusing the refined features and the output features based on a residual structure, then fusing the three-level features according to the sequence of resolution from low to high, further obtaining deep features of the multi-scale spatial attention weight fused by the multi-level pyramid, and finally merging the deep features output by the step S23 and the deep features of the multi-scale spatial attention weight fused by the multi-level pyramid.

In this embodiment, besides the attention among the pixels, the multi-scale attention of the whole image space can also extract useful information, and specifically, after obtaining the feature output with self attention, the GCA module will also use the multi-level pyramid to fuse the multi-scale spatial attention.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for detecting a salient target of a remote sensing image is characterized by comprising the following steps:

s1, acquiring remote sensing image data including a training set and a test set, and constructing a remote sensing image salient object detection model including a detection feature encoder and a cascade feature decoder;

2. The method for detecting the salient object in the remote sensing image according to claim 1, wherein the detection feature encoder is a dense attention flow encoder, the dense attention flow encoder is obtained by improving a VGG16 network as a backbone network, and the improvement process comprises the following steps: the last three fully connected layers of the VGG16 network are removed and truncated before the last pooling layer of VGG16 network, resulting in the dense attention flow encoder.

3. The method for detecting the salient object in the remote sensing image according to claim 2, wherein the step S2 is implemented in a specific manner that includes:

F＝CA(f+δ·(f*(Re ^-1 (Re(f)⊙R))))

4. The method for detecting the salient object in the remote sensing image according to claim 3, wherein the preset spatial pixel relation matrix in the step S21 is expressed by a formula:

M＝{(Re(f)) ^T ⊙Re(f)} ^T

5. The method for detecting the salient object in the remote sensing image according to the claim 4, wherein the normalization processing based on the operation matrix among the pixels in the step S22 is expressed by a formula:

6. The method for detecting the salient object in the remote sensing image according to claim 5, further comprising a step S23' of extracting information of output features by adopting multi-level pyramid fusion multi-scale spatial attention, specifically: updating the output features into three channels with different resolutions by 2-time and 4-time down sampling, refining the features with different scales by using multi-scale spatial attention fused by a multi-level pyramid, fusing the refined features and the output features based on a residual structure, then fusing the three-level features according to the sequence of resolution from low to high, further obtaining deep features of the multi-scale spatial attention weight fused by the multi-level pyramid, and finally merging the deep features output by the step S23 and the deep features of the multi-scale spatial attention weight fused by the multi-level pyramid.

7. The method for detecting the salient object in the remote sensing image according to claim 6, wherein in the step S25, the deep features after the upsampling and the 1 x 1 convolution are spliced with the current features, and the formula is as follows:

F _k ＝Conc(Conv(Up(F ₅ ))，...Conv(Up(F _k-1 ))，F _k )

8. The method for detecting the salient object in the remote sensing image according to claim 7, wherein the preset loss function is a combined loss function with different weight coefficients, and is expressed by a formula:

L＝ω ₁ L _P +ω ₂ L _R +ω ₃ L _MAE +ω ₄ L _S

in the formula, L _P 、L _R 、L _MAE And L _s Respectively representLoss of precision term, recall loss term, mean absolute error loss term, and structural similarity loss term, ω ₁ 、ω ₂ 、ω ₃ 、ω ₄ Respectively represent L _P 、L _R 、L _MAE And L _s Wherein:

L _S ＝1-S _measure

S _measure ＝α×S _o +(1-α)×S _r

9. The method for detecting the salient target of the remote sensing image according to claim 8, further comprising a step S4 of comparing the output corresponding salient map with a truth map, and further measuring the level of the salient map generated by the remote sensing image salient target model.

10. The method for detecting the salient object in the remote sensing image according to claim 9, wherein the step S4 is specifically realized by: and measuring a saliency map generated by the remote sensing image saliency target model based on a preset index PR curve, an F value, an average absolute loss and an S value.