CN114511452A

CN114511452A - Remote sensing image retrieval method integrating multi-scale cavity convolution and triple attention

Info

Publication number: CN114511452A
Application number: CN202111480268.6A
Authority: CN
Inventors: 侯东阳; 王思远
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-05-17
Anticipated expiration: 2041-12-06
Also published as: CN114511452B

Abstract

The invention discloses a remote sensing image retrieval method fusing multi-scale cavity convolution and triple attention, which comprises the following steps: A) constructing a reference network based on a residual error structure; B) replacing a convolution module in the residual error structure with a multi-scale cavity convolution module to enhance the image characteristics; C) embedding a triple attention module in a residual structure formed by adopting a multi-scale cavity convolution module, wherein the triple attention module is embedded in the last convolution layer of each residual block of the residual structure; D) constructing an online label smooth loss function, inputting remote sensing image data into a residual error structure for training, and dynamically generating a smooth weight matrix in the training process; E) extracting a feature vector of the remote sensing image; (F) and matching the characteristics of the remote sensing image with the characteristics of the database image, and retrieving the most similar image. The method can extract the remarkable semantic features of the remote sensing image and can effectively improve the retrieval precision.

Description

Remote sensing image retrieval method integrating multi-scale cavity convolution and triple attention

Technical Field

The invention relates to an image retrieval method, in particular to a remote sensing image retrieval method fusing multi-scale void convolution and triple attention.

Background

Remote sensing image retrieval is a process for inquiring a scene or a target which is interested by a user from a remote sensing image (library) according to a certain similarity index, and is one of key technologies for promoting the sharing and the efficient mining of massive remote sensing images.

However, due to the problems that massive remote sensing images are marked, time and labor are wasted, marked texts cannot accurately express image contents and the like, content-based remote sensing image retrieval (namely 'searching images by images') based on image characteristics as similarity calculation basis becomes the mainstream method. In recent years, a deep learning method represented by a convolutional neural network can extract global features of an image from a large amount of data, and the effect of remote sensing image retrieval is greatly improved.

For this reason, although the required image can be effectively retrieved by using the depth features for retrieval, the method is limited by the characteristics of rich target, complex background, inconsistent scale and the like of the remote sensing image, so that the global features extracted by the CNN are invalid in partial scenes, and the retrieval accuracy is reduced.

Disclosure of Invention

The invention aims to provide a remote sensing image retrieval method fusing multi-scale cavity convolution and triple attention, which can effectively improve retrieval precision.

In order to solve the technical problem, the invention provides a remote sensing image retrieval method fusing multi-scale cavity convolution and triple attention, which comprises the following steps:

A) constructing a reference model based on a residual error structure;

B) replacing a convolution module in the residual error structure with a multi-scale hole convolution module;

C) embedding a triple attention module in the residual structure formed by the multi-scale hole convolution modules, wherein the triple attention module is embedded after the last convolution layer of each residual block of the residual structure;

D) constructing an online label smooth loss function, inputting remote sensing image data into a residual error structure for training, and dynamically generating a smooth weight matrix in the training process;

E) extracting a feature vector of the remote sensing image;

F) and matching the characteristics of the remote sensing image with the characteristics of the database image, and retrieving the most similar image.

Preferably, in step B), the method for replacing the convolution module in the residual structure with the multi-scale hole convolution module is as follows:

B1) setting a 3 multiplied by 3 convolution module in a residual error structure as a hole convolution module;

B2) and respectively setting the expansion rates of the cavity convolution modules as [1,2,5 and 9] to form the multi-scale cavity convolution module.

Further preferably, in step C), the triple attention module models the channel attention and the spatial attention respectively through cross-channel interaction between the channel dimension and the spatial dimension.

Preferably, the interaction steps of the triple attention module are as follows:

C1) setting input characteristic diagram X ∈ R^H×W×CThe size of the characteristic diagram is H multiplied by W multiplied by C;

C2) respectively calculating the information data of the three branches of the triple attention module;

C3) and carrying out average pooling aggregation characteristic output on the information extracted by each branch.

Further preferably, the first branch of the triple attention module is a spatial attention calculation branch, and the spatial attention weight is generated by a Sigmoid activation function after inputting the feature values and performing channel pooling and hole convolution.

Preferably, the second branch of the triple attention module is a channel C and space W dimension interaction capture branch, the input feature X is firstly transposed to be changed into an H × C × W dimension feature, the dimension feature is pooled in the H dimension, and finally, the feature is transposed to be the C × H × W feature through convolution and Sigmoid activation functions.

Further preferably, the third branch of the triple attention module is a channel C and space H dimension interaction capture branch, the input feature X is firstly transposed to be a W × H × C dimension feature, the dimension feature is pooled in the W dimension, and finally, the feature is transposed to be a C × H × W feature through convolution and a Sigmoid activation function.

Preferably, in step D), the smooth weight matrix is used to perform differential distance constraint on the images of different categories, and the specific formula of the smooth weight matrix is as follows,

q(k＝y_i∣x_i)＝1，q(k≠y_i∣x_i)＝0

wherein L is_hardFor cross entropy loss, x_iRepresenting an input image, y_iRepresenting the true category of the input image, K being the predicted category of the input image, K being the total number of image categories, p (K | x)_i) Representing an input image x_iThe probability of prediction as class k, q denotes y_iDistribution of (a), L_softFor online label smoothing loss, t is the number of training iterations,

smoothing the threshold for the tag, an

And continuously and iteratively adjusting in the training process.

Further preferably, in step F), the calculated model loss and the normalized threshold value used in the training method for the online label loss function are:

after calculating model loss, predicting a threshold value after probability updating according to a reference network model, and carrying out comparison

Carrying out standardization to obtain a smooth threshold value when the training iteration number is t +1

Preferably, the reference network model is trained by using a cross entropy loss function and an online label smoothing loss function, and the total loss after training is as follows:

L＝αL_hard+(1-α)L_soft

wherein, L is the training total loss formed after training, and alpha is a balance coefficient used for balancing the cross entropy loss function and the online label smooth loss function.

According to the technical scheme, the remote sensing image retrieval method fusing the multi-scale cavity convolution and the triple attention extracts the features of ground objects with different scales by adopting the multi-scale cavity convolution module, adds the triple attention module in the residual error feature structure model to enhance the features of the remote sensing image, ensures the accuracy of the extracted image features by matching the triple attention module with the multi-scale cavity convolution module, and restrains images with different categories by adopting an online label smoothing loss training method aiming at the complexity of the remote sensing image, so that the retrieved images are more accurate.

Additional features and advantages of the invention will be set forth in the detailed description which follows.

Drawings

FIG. 1 is a flow chart of a remote sensing image retrieval method of the invention which combines multi-scale void convolution and triple attention;

FIG. 2 is a general schematic diagram of the remote sensing image retrieval method of the invention which combines multi-scale void convolution and triple attention;

FIG. 3 is a schematic diagram of a first residual structure and a second residual structure of the remote sensing image retrieval method with fusion of multi-scale cavity convolution and triple attention according to the present invention;

FIG. 4 is a schematic diagram of a third residual structure and a fourth residual structure of the remote sensing image retrieval method with fusion of multi-scale void convolution and triple attention according to the present invention;

FIG. 5 is a comparison graph of the visualization effect of the remote sensing image characteristics of the airplane in the invention and the traditional method;

FIG. 6 is a comparison graph of the visualization effect of the image characteristics of the port according to the present invention and the conventional method;

FIG. 7 is a comparison graph of the visualization effect of the image features of the golf course according to the present invention and the conventional method;

FIG. 8 is a comparison graph of the visualization effect of the image characteristics of the parking lot according to the present invention and the conventional method;

FIG. 9 is a comparison graph of the visualization effect of the reservoir image features according to the present invention and the conventional method;

fig. 10 is a comparison graph of the visualization effect of similar image features in the invention and the conventional method.

Reference numerals

1 remote sensing image 2 first convolution layer

3 first residual Structure 4 second residual Structure

5 third residual Structure 6 fourth residual Structure

7 full link layer 8 on-line label smoothing

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

As shown in fig. 1 to 4, in an embodiment of the method for retrieving a remote sensing image with fused multi-scale void convolution and triple attention provided by the present invention, the method includes the following steps:

A) constructing a reference network based on a residual error structure;

D) constructing an online label smoothing 8 loss function, inputting the data of the remote sensing image 1 into a residual error network for training, and dynamically generating a smoothing weight matrix in the training process;

E) extracting a characteristic vector of the remote sensing image 1;

F) and matching the characteristics of the remote sensing image 1 with the characteristics of the database image, and searching the most similar image.

As shown in fig. 2, in the residual characteristic structure based on the ResNet50 reference network, the accuracy of remote sensing image retrieval can be effectively improved by integrating a reference network model formed by a multi-scale cavity convolution module and a triple attention module. In an adopted reference network model, a shot remote sensing image 1 is input into a first convolution layer 2 as model input data, the first convolution layer 2 is subjected to multiple convolution to form a first residual error structure 3 and a second residual error structure 4, then convolution modules in the first residual error structure 3 and the second residual error structure 4 are replaced by a multi-scale cavity convolution module, and features under different receptive fields are extracted by adopting multi-scale cavity convolution; embedding a triple attention module without parameters into the last convolution layer of each residual structure to form a third residual structure 5 and a fourth residual structure 6, learning an attention weight matrix in a cross-dimension interactive and self-adaptive manner through space and channels, so that important features of an image can be focused, classifying the image extracted through the residual structures by using a full connection layer 7, performing end-to-end training by using an online label smoothing 8 loss function so as to reduce intra-class differences and enhance inter-class separability, and finally performing verification on a public remote sensing image 1 data set, wherein verification data shows that the retrieval accuracy of the remote sensing image 1 can be effectively improved.

Specifically, compared with natural images, the remote sensing image 1 has a more complex background, and is also prone to cause larger intra-class differences, and different classes of images are also prone to have higher similarities, which results in the problems that depth features after training have larger intra-class distances and unclear inter-class boundaries, and the like, and this requires that inter-class separability and intra-class compactness are increased in the training process, so that similar images are divided into more compact clusters, and the dynamically generated smooth weight matrix performs difference distance constraint on the images of different classes to shrink the intra-class distances and enlarge the inter-class differences, and the specific formula of the smooth weight matrix is as follows,

q(k＝y_i∣x_i)＝1，q(k≠y_i∣x_i)＝0

wherein L is_hardFor cross entropy loss, x_iRepresenting an input image, y_iRepresenting the true class of the input image, K the prediction class of the input image, K the total number of image classes, p (K | x)_i) Representing an input image x_iProbability of prediction as class k, q denotes y_iDistribution of (a), L_softFor online label smoothing 8 loss, t is the number of training iterations,

smoothing the threshold for the tag, an

And continuously and iteratively adjusting in the training process.

Specifically, the calculated model loss and the normalized threshold value adopted in the training method of the online tag loss function are as follows:

Then, a cross entropy loss function and an online label smoothing 8 loss function are adopted to jointly train the reference network model, and the total loss obtained after training is as follows:

L＝αL_hard+(1-α)L_soft

wherein, L is the training total loss formed after training, and alpha is a balance coefficient used for balancing the cross entropy loss function and the online label smoothing 8 loss function.

In an embodiment of the remote sensing image retrieval method combining the multi-scale cavity convolution and triple attention, the step B specifically includes setting the multi-scale cavity convolution with expansion rate [1,2,5,9] to be embedded into the residual error structure.

Specifically, the cavity convolution has a larger receptive field under the condition that no additional parameter is introduced, multi-scale context information can be captured at the same time, and the cavity convolution is applied to image separation and target detection, when the features of the remote sensing image 1 in different scales are captured, a multi-scale cavity convolution module is designed in a reference network model, so that feature extraction of the information of the remote sensing image 1 in different scales is realized.

In particular, a greater range of feature information is captured without introducing external parameters. The expansion rate of the hole convolution defines the spacing of values at which the convolution kernel processes the data. For a convolution kernel of size k × k, the extended convolution kernel of size k is obtained from equation (1) at an extension rate r_d×k_d：k_d＝k_d+(k-1)·(r-1)。

The cavity convolution increases the information reception field, and meanwhile, the convolution space is discontinuous, so that the problem that remote information is irrelevant is brought, information loss of a small target can be caused for the remote sensing image 1 with a complex background, and the continuity of image information is ensured by the multi-scale cavity convolution module adopted by the method. The expansion rate of the superposition void convolution cannot have a common divisor of 1 and other than the expansion rate, after the superposition void convolution is subjected to pooling operation, the distribution of the expansion rate follows a zigzag heuristic structure, for example, for a convolution kernel with k equal to 3, an ascending group with expansion rates of [1,2,5 and 9] is set, and ground feature information with different sizes is extracted in a self-adaptive mode, wherein the convolution with a smaller expansion rate is used for capturing short-distance ground feature information, and the convolution with a larger expansion rate is used for capturing long-distance information, so that information can be acquired from a wider space without destroying the continuity of a convolution area.

In an embodiment of the remote sensing image retrieval method fusing multi-scale cavity convolution and triple attention, the interaction steps of the triple attention module are as follows:

Specifically, the visual attention mechanism obtains a target area needing important attention by rapidly scanning a global image, and then puts more attention resources into the area, obtains more detailed information of the target needing attention, and suppresses other useless information. In the application of the method to the remote sensing image 1, because the remote sensing image 1 contains a large amount of background information and has a great influence on depth feature discrimination, a triple attention module with almost no parameters is embedded into a residual feature structure model, two branches of the triple attention module are respectively used for capturing cross-channel interaction of channel dimension and space dimension, and one branch is used for carrying out space attention weight calculation and respectively modeling channel attention and space attention. The first branch is a channel attention calculation branch, and input features are subjected to channel pooling and 7 × 7 convolution firstly, and then a Sigmoid activated function generates a spatial attention weight; the second branch is a channel C and space W dimension interactive capture branch, input features X are firstly transformed into H multiplied by C multiplied by W dimension features through transposition, then pooling is carried out on H dimension, and finally the features are transformed into C multiplied by H multiplied by W features through 7 multiplied by 7 convolution and a Sigmoid activation function. And the third branch is a channel C and space H dimension interactive capture branch, the input feature X is firstly transformed into a W multiplied by H multiplied by C dimension characteristic through transposition, the dimension characteristic is pooled in the W dimension, and finally transformed into a C multiplied by H multiplied by W characteristic through convolution and a Sigmoid activation function. And finally, performing average pooling aggregation characteristic output on the information extracted by each branch.

In order to verify the accuracy of the retrieval method, the retrieval method is carried out on an Ubuntu 20 system which is loaded with an Intel 3.7GHz i9-10900K processor and an NVIDIA GeForce GTX3090 display card. In the training phase, a training batch is set to be 40epoch, an optimizer is Adam, the initial learning rate is 3e-4, and the weight attenuation is 3 e-4. In all experiments, the input image has been resized to 224 × 224 pixels. For comparison, four public remote sensing image 1 data sets are used as verification data sets, and the four data sets are respectively:

1) UCMD: the UCMD dataset contains 2100 remote sensed images 1 from the us geological survey (USGS) containing 21 different categories of remote sensed images 1 for airplanes, buildings, rivers, etc. each category containing 100 images with an image size of 256 x 256 pixels.

2) NWPU the NWPU data set contains 45 classes of images, each class containing 700 images for a total of 31500 images with an image size of 256 x 256 pixels.

3) Pattern Net A Pattern Net dataset consists of 38 classes, each class containing 800 images collected from 256 x 256 pixels of Google Earth. The ground resolution of the image is 0.6-4.7 meters.

4) VArcGIS: the VArcGIS large-scale remote sensing data set consists of 38 types of images collected from ArcGIS World image, each type comprises 1504-.

For the reference dataset used, we randomly partitioned the training set, the test set, in an 8:2 ratio for each class of images, the training set was re-partitioned into two parts, 80% of the images were used for training and the remaining 20% for validation. During the testing process, the output of the fully connected layer 7 is removed from the model as an image feature, and the Euclidean distance is used to measure the similarity of the features. The closer the visual features of the query image are to other images, the more similar these images are, and when performing the comparative evaluation, the results are evaluated using three standard retrieval metrics of Average Normalized Modified Retrieval Rank (ANMRR), average retrieval precision (mapp), and precision at k, and we set the k value to 5, 10, 20, 50, 100, and 1000, where the lower the ANMRR value, the higher the mapp and Pk values, the better the retrieval precision.

Specific results are generated by performing experiments on the four data sets as shown in tables 1 and 2, table 1: retrieval accuracy on four reference datasets

Table 2: retrieval accuracy of different methods on UCMD dataset

In table 1, the larger the mAP and the larger the Pk, the better the smaller the ANMRR, and as can be seen from table 1, the average retrieval accuracy on the PatternNet and the VArcGIS data sets with clear targets is respectively improved by 6.17% and 9.67%, and the average retrieval accuracy on the UCMD and the NWPU data sets with complex backgrounds is respectively improved by 24.46% and 33.84%, compared with other algorithms, the ANMRR value obtained on the UCMD data sets with complex backgrounds by the method of the present invention is the smallest, the mAP value is the largest, and the retrieval accuracy is the highest. Through the comparison result, the higher requirement on the feature extraction capability of the image with the complex background can be obviously seen, and the multi-scale features and the key area features of the image are extracted from the remote sensing image 1 through the reference network model, so that the performance improvement is greatly achieved on the data set with rich scenes and complex background.

In addition, in order to test the effectiveness of the multi-scale feature extraction module and the attention module, a Grad-CAM + + tool is adopted to visually compare feature heat maps output by the models so as to compare the image characterization capabilities of the models, as shown in FIGS. 5 to 10, the color is more red, which indicates that the models are more sensitive to the pixel values at the positions, i.e., the attention is higher. By comparing the reference method with the remote sensing image 1 detection method adopted by the present invention, the position of the heat map of the reference method is generally inaccurate, for example, as shown in fig. 5, fig. 5(a) is the remote sensing image 1 obtained by shooting, fig. 5(b) is the deviation of the spatial positioning of the feature heat map positioned by the conventional reference method, the focus of the heat map is located in the blank area at the right lower part of the airplane, while fig. 5(c) positioned by the method herein is obvious that the spatial positioning of the feature heat map is just positioned on the airplane without deviation; in fig. 6, fig. 6(a) is a base remote sensing image 1, fig. 6(b) locates the estuary by using a reference method, the characteristic heat map is obviously deviated from two estuaries and is in the middle position of the two estuaries, and fig. 6(c) locates the characteristic heat map by using the method disclosed herein, is accurate and does not have any deviation on the positions of the two estuaries; in fig. 7 to 10, the spatial positioning of the feature heat map by the reference method in fig. 7(b) to 10(b) is more or less deviated, which is particularly obvious in fig. 10, the positioning position of the feature heat map by the reference method in fig. 10(b) is wrong, covering the irrelevant area, while the spatial positioning of the feature heat map by the present method in fig. 10(c) captures the target object feature accurately, and the comparison of the above-mentioned sets of pictures shows that the reference model has a weak capability of capturing the salient features of the image. Compared with the prior art, the method for retrieving the remote sensing image 1 can capture the target object accurately, the formed characteristic heat map can cover the target object, the characteristic heat map formed by the method is more reasonable in covering position and higher in fineness, for example, in a parking lot image in a fourth row, the characteristic heat map generated by the method is accurate in covering range, a heat map focus is better covered on a ground object target with higher detail level, and through comparison of the characteristic heat map and the ground object target, the remote sensing image 1 retrieving method has stronger image characteristic extraction capacity, can capture the multi-scale characteristic and the obvious distinguishing characteristic of the remote sensing image 1 better, and effectively improves retrieving precision.

In the description of the present invention, reference to the description of the terms "one embodiment," "some embodiments," "an implementation," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In the present disclosure, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited thereto. Within the scope of the technical idea of the invention, numerous simple modifications can be made to the technical solution of the invention, including combinations of the individual specific technical features in any suitable way. The invention is not described in detail in order to avoid unnecessary repetition. Such simple modifications and combinations should be considered within the scope of the present disclosure as well.

Claims

1. The remote sensing image retrieval method fusing the multiscale cavity convolution and triple attention is characterized by comprising the following steps of:

A) constructing a reference network based on a residual error structure;

E) extracting a feature vector of the remote sensing image;

2. The method according to claim 1, wherein in step B), the method for replacing the convolution module in the residual structure with the multi-scale hole convolution module is as follows:

3. The method of claim 1, wherein in step C) the triplet attention module models channel attention and spatial attention, respectively, through cross-channel interaction between channel dimensions and spatial dimensions.

4. The method of claim 3, wherein the interaction steps of the triple attention module are as follows:

5. The method of claim 3, wherein the first branch of the triple attention module is a spatial attention computation branch, and the spatial attention weight is generated by a Sigmoid activation function after inputting the feature values and after channel pooling and hole convolution.

6. The method of claim 3, wherein the second branch of the triple attention module is a channel C and space W dimension interactive capture branch, the input features X are first transposed into H X C X W dimension features, the dimension features are pooled in the H dimension and are finally transposed into C X H X W features through convolution and Sigmoid activation functions.

7. The method of claim 3, wherein a third branch of the triple attention module is a channel C and space H dimension interactive capture branch, the input features X are firstly transformed into W X H X C dimension features through transposition, the dimension features are pooled in W dimension and are finally transformed into C X H X W features through convolution and Sigmoid activation functions.

8. The method according to claim 1, wherein in step D), the smoothing weight matrix is used to perform differential distance constraint on different classes of images, and the specific formula of the smoothing weight matrix is as follows,

q(k＝y_i∣x_i)＝1，q(k≠y_i∣x_i)＝0

wherein L is_hardFor cross entropy loss, x_iRepresenting an input image, y_iRepresenting the true category of the input image, K being the predicted category of the input image, K being the total number of image categories, p (K | x)_i) Representing an input image x_iProbability of prediction as class k, q denotes y_iDistribution of (a), L_softFor online label smoothing loss, t is the number of training iterations,

smoothing the threshold for the tag, an

And continuously and iteratively adjusting in the training process.

9. The method according to claim 8, wherein in step D), the calculated model loss and the normalized threshold value used in the training method of the online tag loss function are:

10. The method of claim 8, wherein the reference network model is trained by using a cross entropy loss function and an online label smoothing loss function, and the total loss after training is:

L＝αL_hard+(1-α)L_soft