CN113343953B

CN113343953B - FGR-AM method and system for remote sensing scene recognition

Info

Publication number: CN113343953B
Application number: CN202110894846.4A
Authority: CN
Inventors: 夏景明; 丁悦; 谈玲
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing Zhiqiang Information Technology Co.,Ltd.
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2021-12-21
Anticipated expiration: 2041-08-05
Also published as: CN113343953A

Abstract

The invention discloses a FGR-AM method for remote sensing scene recognition, which comprises the following steps: performing effective information enhancement processing and ineffective information suppression processing on the image features extracted by the 3 rd and 5 th bottle neck convolution modules; respectively extracting contour information contained in the remote sensing image and more interesting features in vision from the image features extracted by the 3 rd bottleneck convolution module, and extracting detail features contained in the remote sensing image from the image features extracted by the 5 th bottleneck convolution module; fusing the channel attention and spatial attention enhancement features; and mapping the multi-dimensional features to orthogonal k-dimensional features, and identifying and classifying the remote sensing images. According to the method, the main characteristics and the detail characteristics of the image are considered, the interested information and the detail information are extracted and fused, so that the identification precision of the network is improved, and the network can accurately identify scenes in complex scenes and similar scenes.

Description

FGR-AM method and system for remote sensing scene recognition

Technical Field

The invention relates to the technical field of computer vision, in particular to a FGR-AM method and a FGR-AM system for remote sensing scene recognition.

Background

The remote sensing scene classification means that the image is divided into blocks, and each block is attached with a proper category (such as residential areas, farmlands, rivers, forests and the like) according to the composition of the blocks. This is of great significance for image management, retrieval, analysis, detection and recognition of typical targets. As resolution increases, images become more diverse, allowing fine-grained classification and recognition. Meanwhile, the high-resolution remote sensing image has richer details, the characteristics in the image are more various, and objects on the ground are usually staggered. The similarity between images of the same type decreases, and the difference between images of the same type significantly increases. In addition, the rotational and positional relationships between objects in the image need also be considered. These problems present challenges to high precision scene classification.

The rapid development of the high-resolution remote sensing image brings new opportunities to the remote sensing scene. But there is also a greater challenge that rich image detail information contains more invalid information. For example, the park grassland and the golf course have high similarity, and after deep extraction of the image, too much detail information affects judgment of the network. A greedy layered unsupervised pre-training learning algorithm is provided, and the algorithm has better performance in both aviation scene classification and high-resolution land utilization classification. The current high-precision scene classification method mostly adopts deep CNN (such as VGG16, google lenet, ResNet 50). However, because the remote sensing image has few categories and relatively small labeled data amount, the direct application of the deep convolution features to the remote sensing image is difficult, a multi-subset feature fusion method is proposed, the deep features extracted from a plurality of convolution neural networks are fused, and the global and local information of the deep features is integrated, so that the lower-dimensional features are obtained, and the stronger resolution is realized.

In recent years, inspired by human visual mechanisms, attention mechanisms have promoted the performance of many CNN-based visual tasks. The rolling block attention module (CBAM) carries out space attention and channel attention in turn, and SKNet fuses attention-enhanced multi-scale features to realize approximate self-adaptive selection of an accepting domain. For example, the invention with the publication number of CN112861978A provides a multi-branch feature fusion remote sensing scene image classification method based on an attention mechanism, and aims to solve the problem of low accuracy in the existing method for classifying remote sensing image scenes. The process is as follows: the method comprises the following steps of firstly, acquiring a remote sensing image, and preprocessing the remote sensing image to obtain a preprocessed remote sensing image; step two, establishing a multi-branch feature fusion convolutional neural network AMB-CNN based on an attention mechanism; training a multi-branch feature fusion convolutional neural network AMB-CNN based on an attention mechanism by adopting the preprocessed remote sensing image to obtain a pre-trained multi-branch feature fusion convolutional neural network AMB-CNN based on the attention mechanism; and fourthly, classifying the remote sensing images to be recognized by adopting the trained AMB-CNN. The invention with the publication number of CN113052188A discloses a remote sensing image target detection method, which adopts a ResNet residual error network to extract a multi-scale feature map, and adopts a cross-channel information fusion mode according to the target characteristics to fuse the multi-scale features, so as to enhance the semantic information of the features and the richness of the features to obtain the fused multi-scale feature map; introducing an attention mechanism on the fused feature map to generate a probability significant feature map, weakening redundant background information in the remote sensing image and enhancing the significance of the target; and introducing the position information of each key point of the detection frame after the first regression, reconstructing a feature map with the position information, and performing final multi-class classification and positioning prediction. For the two examples, the features of the third convolution module are extracted from the former, processed by the attention module and then fused with the features of the original convolution module, so that the remote sensing scene images are classified more accurately under lower complexity; the target features of the remote sensing image are combined and then processed by the attention module, and the situations that the target size is small, background information is complex and positioning is not accurate enough in the remote sensing image can be processed.

However, both methods are not suitable for processing the situation that scenes with high similarity exist in the high-resolution remote sensing image or scenes with high similarity and scenes with large differences exist at the same time. In fact, in most of the existing methods for extracting the remote sensing scene specific graph by using the neural network, when the main features of the image are concerned, the detail information is omitted, or after the details are extracted excessively, the identification precision of the network in a high-similarity scene is reduced, so that the problems are difficult to solve.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an FGR-AM method and system for remote sensing scene recognition, which take main features and detail features of images into consideration, so that the extracted features not only contain rich detail features (rich detail features extracted by a 5 th channel enable scene categories to be accurately recognized in the remote sensing scene with difference), but also do not ignore more interesting information in vision (the features extracted by a 3 rd channel enhance effective information and effectively filter ineffective information in an attention module); the interesting information and the detail information are extracted and fused, so that the identification precision of the network is improved, and the network can accurately identify scenes in complex scenes and similar scenes.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, an embodiment of the present invention provides an FGR-AM method for remote sensing scene recognition, where the FGR-AM method includes the following steps:

s1, performing feature extraction on the input original remote sensing image by adopting 5 bottleneck convolution modules which are connected in sequence;

s2, performing effective information enhancement processing and ineffective information suppression processing on the image features extracted by the 3 rd and 5 th bottle neck convolution modules;

s3, extracting contour information and more interesting features in vision contained in the remote sensing image from the image features extracted by the 3 rd bottleneck convolution module, and extracting detail features contained in the remote sensing image from the image features extracted by the 5 th bottleneck convolution module;

s4, gathering the channel attention and the space attention enhancement features by adopting a bilinear fine-grained fusion feature module, and fusing the extracted contour information contained in the remote sensing image, the features more interesting in vision and the detail features contained in the remote sensing image to form a bilinear vector with global space and channel consistency representation;

and S5, adopting a principal component analysis module to map the multidimensional characteristics generated in the step S4 to orthogonal k-dimensional characteristics, and identifying and classifying the remote sensing images.

Optionally, in step S1, the feature extraction process of the bottleneck convolution module includes the following steps:

s11, inputting the image into a standard convolution layer with 1 x 1 convolution kernel and swish activation function to extract the characteristics, wherein the channel is expanded to n times of the number of basic channels;

s12, inputting the features extracted in the step S11 into a Depthwise convolutional layer with a convolutional kernel of 3 multiplied by 3 and a step length of 2 for feature extraction, wherein the number of channels is unchanged;

and S13, inputting the image features extracted in the step S12 into a linear convolution with a convolution kernel of 1 multiplied by 1, and reducing the dimension of the feature map to the original number of channels.

Optionally, the base number of channels for the 5 bottleneck convolution modules is 64, 128, 256, 512 and 512, respectively, in process order;

wherein, the value of the expansion value n corresponding to the 1 st and the 2 nd bottleneck convolution modules is 6; the value of the expansion value n corresponding to the 3 rd and 4 th bottleneck convolution modules is 4; the value of the expansion value n corresponding to the 5 th bottleneck convolution module is 2.

Optionally, in step S2, the process of performing the valid information enhancement processing and the invalid information suppression processing on the image features extracted by the 3 rd and 5 th bottle neck convolution modules includes the following steps:

s21, aiming at the 3 rd or 5 th bottleneck convolution module, extracting the features of the corresponding bottleneck convolution module

Performing maximum pooling and average pooling respectively, wherein the feature dimension after pooling is 1 × 1 × c

C represents the number of channels, h represents the height of the input feature map, and w represents the width of the input feature map;

s22, inputting the two characteristic dimensions of 1 × 1 × c after the maximum pooling and the average pooling into a shared MLP, wherein the first layer and the second layer of the MLP are c/16 and c respectively;

s23, performing weight addition on the two obtained feature vectors, and calculating a weight matrix of channel attention by using a sigmoid function to obtain

。

Optionally, in step S3, the process of simultaneously extracting contour information and features of more interest in vision contained in the remote sensing image from the image features extracted by the 3 rd bottleneck convolution module and extracting detail features contained in the remote sensing image from the image features extracted by the 5 th bottleneck convolution module respectively includes the following steps:

s31, extracting the weight matrix

To carry out

After operation, a new weight matrix is obtained

；

S32, the weight matrix is convolved by 7 × 7, and the sigmoid function is used to calculate the spatial attention weight matrix to obtain the spatial attention weight matrix

(ii) a Wherein the content of the first and second substances,

operated as

F，

Representing element-level multiplication;

s33, mixing

And

to carry out

After operation, a new weight matrix is obtained

(ii) a Wherein the content of the first and second substances,

operated as

。

Optionally, in step S5, the process of mapping the multidimensional features to orthogonal k-dimensional features by using the principal component analysis module, and identifying and classifying the remote sensing image includes the following steps:

s51, in the network training stage, training by adopting a full connection layer;

and S52, when searching and identifying the image, adopting the principal component analysis module to replace the full connection layer and mapping the multidimensional characteristics to the orthogonal k-dimensional characteristics.

In a second aspect, an embodiment of the present invention provides an FGR-AM system for remote sensing scene recognition, where the FGR-AM system includes:

the FGR-AM remote sensing scene network comprises 5 bottleneck convolution modules, a first channel attention module, a first spatial attention module, a second channel attention module, a second spatial attention module, a bilinear feature fusion module and a principal component analysis module;

and the FGR-AM remote sensing scene network training module is used for replacing the main component analysis module with a full connection layer to train the FGR-AM remote sensing scene network.

The 5 bottleneck convolution modules are sequentially connected and used for carrying out feature extraction on the input original remote sensing image;

the input end of the first channel attention module is connected with the output end of the 3 rd bottleneck convolution module, and the output end of the first channel attention module is connected to the bilinear feature fusion module through the first spatial attention module; the input end of the second channel attention module is connected with the output end of the 5 th bottleneck convolution module, and the output end of the second channel attention module is connected to the bilinear feature fusion module through the second spatial attention module;

the first channel attention module and the second channel attention module are respectively used for performing effective information enhancement processing and ineffective information suppression processing on the image features extracted by the 3 rd and 5 th bottleneck convolution modules; the first spatial attention module is used for simultaneously extracting contour information contained in the remote sensing image and more interesting features in vision from the image features extracted by the 3 rd bottleneck convolution module; the second spatial attention module is used for extracting detail features contained in the remote sensing image from the image features extracted by the 5 th bottleneck convolution module;

the bilinear feature fusion module is used for gathering the channel attention and the space attention enhancement features, fusing the extracted contour information contained in the remote sensing image, the features which are more interesting in vision and the detail features contained in the remote sensing image, and forming a bilinear vector with global space and channel consistency representation;

and the principal component analysis module is used for mapping the multidimensional characteristics generated by the bilinear characteristic fusion module to orthogonal k-dimensional characteristics and identifying and classifying the remote sensing image.

Optionally, the bottleneck convolution module includes a standard convolution layer, a Depthwise convolution layer and a linear convolution layer, which are connected in sequence;

the convolution kernel of the standard convolution layer is 1 multiplied by 1, the activation function is swish, and the channel expansion is n times of the number of basic channels; the convolution kernel of the Depthwise convolution layer is 3 multiplied by 3, the step length is 2, and the number of channels is maintained to be n times of the number of basic channels; the convolution kernel of the linear convolution layer is 1 multiplied by 1, and the number of channels is reduced to the number of basic channels;

the number of basic channels of the 5 bottleneck convolution modules is 64, 128, 256, 512 and 512, respectively, in image processing order; wherein, the value of the expansion value n corresponding to the 1 st and the 2 nd bottleneck convolution modules is 6; the value of the expansion value n corresponding to the 3 rd and 4 th bottleneck convolution modules is 4; the value of the expansion value n corresponding to the 5 th bottleneck convolution module is 2.

Optionally, the process of performing valid information enhancement processing and invalid information suppression processing on the image features extracted by the 3 rd and 5 th bottleneck convolution modules by the first channel attention module and the second channel attention module includes the following steps:

aiming at the 3 rd or 5 th bottleneck convolution module, the features extracted by the corresponding bottleneck convolution module are

inputting two characteristic dimensions of 1 multiplied by c after the maximum pooling and the average pooling into a shared MLP, wherein the first layer and the second layer of the MLP are respectively c/16 and c;

and performing weight addition on the two obtained feature vectors, and calculating a weight matrix of the channel attention by using a sigmoid function to obtain the channel attention weight matrix.

Optionally, the process of extracting features by the first spatial attention module or the second spatial attention module includes the following steps:

s31, extracting the weight matrix from the first channel attention module or the second channel attention module

To carry out

After operation, a new weight matrix is obtained

；

S32, weighting the matrix

Performing convolution processing of 7 × 7, and calculating a weight matrix of spatial attention by using sigmoid function to obtain

(ii) a Wherein the content of the first and second substances,

operated as

F，

Representing element-level multiplication;

s33, mixing

And

to carry out

After operation, a new weight matrix is obtained

(ii) a Wherein the content of the first and second substances,

operated as

。

The invention has the beneficial effects that:

the invention gives consideration to the main characteristics and the detail characteristics of the image, so that the extracted characteristics not only contain rich detail characteristics (the rich detail characteristics extracted by the 5 th channel can accurately identify scene categories in the remote sensing scene with difference), but also do not ignore more interesting information in vision (the characteristics extracted by the 3 rd channel not only enhance effective information but also effectively filter ineffective information in an attention module); the interesting information and the detail information are extracted and fused, so that the identification precision of the network is improved, and the network can accurately identify scenes in complex scenes and similar scenes.

The invention adopts 5 bottleneck convolution modules, extracts image characteristics, solves the problem of information loss caused by using an activation function in a network, and reduces parameters in the network. On the basis, a channel attention module is adopted for the No. 3 and No. 5 bottleneck convolution modules, so that effective information in a channel is enhanced, and ineffective information is restrained. Also, mid-level features represent generally better effects on scaling, rotation, and illumination changes in the image than low-level features.

The method adopts the space attention module to extract the characteristics of a more interesting part in the image vision, so as to improve the identification precision of the network on similar scenes. The invention adopts a CBP linear fusion module, outputs the outer product at the same space position, calculates bilinear characteristics, captures the pairwise correlation between characteristic channels, and provides stronger representation than a linear model through linear combination. The invention adopts PCA to map n-dimensional features from CBP to orthogonal k-dimensional features, so that the network has better generalization capability.

Drawings

FIG. 1 is a flow chart of an FGR-AM method for remote sensing scene recognition according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a channel attention module and a space attention module according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an FGR-AM remote sensing scene network according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a comparison result between the FGR-AM method of the embodiment of the present invention and several currently used remote sensing scene recognition methods.

Fig. 5 is a schematic diagram of the recognition accuracy of the FGR-AM method of the embodiment of the present invention and several common urban remote sensing scene recognition methods on the NWPU-rescic 45 data set.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings.

It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.

Example one

FIG. 1 is a flow chart of an FGR-AM method for remote sensing scene recognition according to an embodiment of the present invention. The embodiment is applicable to the case of performing identification detection on the remote sensing scene image through a device such as a server, and the method can be executed by an FGR-AM system for remote sensing scene identification, and the system can be implemented in a software and/or hardware manner, and can be integrated in an electronic device, for example, an integrated server device.

Referring to fig. 1, the FGR-AM method includes the steps of:

and S1, performing feature extraction on the input original remote sensing image by adopting 5 bottleneck convolution modules which are sequentially connected.

And S2, performing effective information enhancement processing and ineffective information suppression processing on the image features extracted by the 3 rd and 5 th bottle neck convolution modules.

And S3, extracting contour information and more interesting visual features contained in the remote sensing image from the image features extracted by the 3 rd bottle neck convolution module and extracting detail features contained in the remote sensing image from the image features extracted by the 5 th bottle neck convolution module respectively.

And S4, gathering the channel attention and the space attention enhancement characteristics by adopting a bilinear fine-grained fusion characteristic module, and fusing the extracted contour information contained in the remote sensing image, the characteristics more interested in vision and the detail characteristics contained in the remote sensing image to form a bilinear vector with global space and channel consistency representation.

In this embodiment, an FGR-AM remote sensing scene network is first constructed, and fig. 3 is a schematic structural diagram of the remote sensing scene network according to the embodiment of the present invention. The FGR-AM remote sensing scene network comprises 5 bottleneck convolution modules, a channel attention module, a space attention module, a bilinear fine-grained fusion module and a principal component analysis module.

The 5 bottleneck convolution modules are used for extracting the characteristics of the remote sensing image, the problem of information loss caused by the use of an activation function in the network is solved while the image characteristics are extracted, and the parameters in the network are reduced. The channel attention module is used for performing information enhancement operation on the image features extracted by the 3 rd and 5 th bottleneck convolution modules, enhancing effective information in a channel and inhibiting ineffective information; also, mid-level features represent generally better effects on scaling, rotation, and illumination changes in the image than low-level features. The spatial attention module is used for extracting interesting features in the remote sensing image and extracting more interesting parts in image vision, so that the identification precision of the network to similar scenes is improved. The bilinear fine-grained fusion module is used for outputting an outer product at the same spatial position by adopting the CBP linear fusion module, calculating bilinear features, and capturing pairwise correlation among feature channels by the outer product. The linear combination thus provides a stronger representation than the linear model. And the principal component analysis module is used for replacing an FC layer by PCA when retrieving and identifying images, mapping the multidimensional features to orthogonal k-dimensional features, retrieving and identifying the images and improving the generalization capability of the network.

The specific steps of the method mentioned in this embodiment are described in detail by one of the examples.

Inputting an original remote sensing image to a FGR-AM remote sensing scene network, and extracting the characteristics of the image by adopting 5 bottleneck convolution modules, wherein the method comprises the following substeps:

step 1-1, inputting a remote sensing image of an NWPU-RESISC45 data set into a network, setting image pixels as 224 multiplied by 224, performing standard convolution on the image, wherein a convolution kernel is 3 multiplied by 3, the number of channels is 64, and the size of an output feature map is 224 multiplied by 64.

Step 1-2, in the first layer of the bottleneck convolution module 1, the convolution kernel is 3 × 3, the activation function is swish, the number of channels is expanded to 6 times of the number of basic channels, and the size of the output feature map is 224 × 224 × (3 × 64).

Step 1-3, the second layer of the bottleneck convolution module 1 adopts a Depthwise convolution mode, a convolution kernel is 3 multiplied by 3, an activation function is swish, a step size is 2, and the size of an output feature map is 112 multiplied by (3 multiplied by 64).

And 1-4, reducing the dimension of the feature map by adopting a 1 × 1 standard convolution in the third layer of the bottleneck convolution module 1, wherein the size of the output feature map is 112 × 112 × 64.

1-5, the other 4 bottleneck convolution modules are similar to the module 1, the channel of the 2 nd bottleneck convolution module is 6 times of the base channel when expanded, the channels of the 3 rd and 4 th bottleneck convolution modules are 4 times of the number of the base channels when expanded, the channel of the 5 th bottleneck convolution module is 2 times of the number of the base channels when expanded, and the number of the base channels of the 5 bottleneck convolution modules is 64, 128, 256, 512 and 512 respectively.

The method comprises the steps of firstly, carrying out multi-layer extraction on image features, carrying out multi-layer feature extraction on images, and solving the problem of information loss caused by the use of an activation function in a network by not adopting the activation function in the third layer of a bottleneck convolution module, and reducing parameters in the network.

Step two, inputting the image features extracted by the 3 rd and 5 th bottleneck convolution modules into a spatial attention module, and enhancing the features extracted by the bottleneck convolution modules of different layers, wherein the spatial attention module comprises the following substeps as shown in fig. 2:

step 2-1, the feature size extracted by the 3 rd bottle neck convolution module is 56 × 56 × 256, the feature size extracted by the 5 th convolution module is 14 × 14 × 512, and the two extracted feature sizes are input into the channel attention module.

Step 2-2, taking the feature size extracted by the 3 rd bottleneck convolution module as an example, performing maximum pooling and average pooling on the features 56 × 56 × 256 extracted by the 3 rd bottleneck convolution module respectively to obtain two feature vectors of 1 × 1 × 256.

Step 2-3, two feature sizes of 1 × 1 × 256 are input into the shared MLP, the first and second layers of the MLP being 16 and 256, respectively.

And 2-4, performing weight addition on the two obtained characteristic sizes of 1 multiplied by 256, and calculating a weight matrix of channel attention by using a sigmoid function to obtain a weight matrix of 1 multiplied by 256.

And 2-5, repeating the steps on the characteristics extracted by the 5 th bottleneck convolution module.

In the second step, a channel attention module is adopted for the 3 rd and 5 th bottleneck convolution modules, so that effective information in a channel is enhanced, and ineffective information is restrained. Also, mid-level features represent generally better effects on scaling, rotation, and illumination changes in the image than low-level features. Specifically, compared with the features extracted by the 3 rd bottleneck convolution module, after the 1 st and the 2 nd bottleneck convolution modules, the feature information extracted by the two modules contains excessive invalid information. After the attention module is input, invalid information can be enhanced at the same time, and the improvement of the identification precision is not facilitated. At block 3, the network extracted features retain both contour information in the image and more interesting information in the image. And the 5 th convolution module carries out deep feature extraction, and finally, the detail features are fully extracted. The features extracted by the two convolution modules are respectively input, and after the features are output from the attention module, the obtained feature information of the 3 rd channel is strengthened in more interesting parts, such as: in the two scenes of the park grassland and the golf course in the remote sensing scene, the images have high similarity, and the two scenes are correctly identified and distinguished through the contour extracted by the 3 rd channel and the more interesting characteristics in vision, namely through the difference of buildings and other parts in the two scenes. After the detail features extracted by the 5 th channel are subjected to feature enhancement through the attention module, the recognition of scenes with large differences is high in recognition, and the scene category can be judged through detail information. The features extracted by the two modules are fused, so that the network can have better identification precision in the process of identifying the remote sensing scene, no matter the scene is higher in similarity or is higher in difference. In addition, the attention module can suppress invalid features through a channel convolution module in the attention mechanism besides enhancing the features, highlight key feature positions and obtain a global space/channel consistent representation.

Step three, extracting visually interesting features in the remote sensing image by adopting a space attention module, as shown in fig. 2, comprising the following substeps:

step 3-1, taking the extracted characteristics of the 3 rd bottleneck convolution module as an example, after passing through the channel convolution module, performing the obtained 1 × 1 × 256 weight matrix

After operation, a new weight matrix is obtained

，

Operated as

F, wherein F is 1X 256.

Step 3-2, weighting the matrix

。

Step 3-3, mixing

And

to carry out

After operation, a new weight matrix is obtained

(ii) a Wherein the content of the first and second substances,

operated as

。

And thirdly, extracting the characteristics of the more interesting part in the image vision by adopting a space attention module so as to improve the identification precision of the network on the similar scene.

Step four, fusing the features extracted from different layers by adopting a bilinear fine-grained fusion feature module, inputting a final weight matrix obtained by two channels passing through an attention module into a CBP (cell based map), and integrating the channel attention and space attention enhancement features by the CBP to form a bilinear vector with global space and channel consistency representation;

and in the fourth step, a CBP linear fusion module is adopted, the outer product is output at the same spatial position, the bilinear characteristic is calculated, and the outer product captures the pairwise correlation among characteristic channels. The linear combination thus provides a stronger representation than the linear model.

Step five, in the stage of searching and identifying images, a principal component analysis module is adopted to map the multidimensional characteristics to orthogonal k-dimensional characteristics, so that the generalization capability of the network is improved, and the method comprises the following substeps:

and step 5-1, adopting an FC layer when the FGR-AM network is trained, wherein the dimension of the last FC is equal to the class number of the image data set and is 45.

And 5-2, replacing the FC layer with PCA when searching and identifying the image.

And in the fifth step, when the images are retrieved and identified, the FC layer is replaced by the PCA, and the FC layer is seriously influenced by the original training data set and cannot be popularized to a new target data set. The PCA maps the multidimensional characteristics to orthogonal k-dimensional characteristics, and the generalization capability of the network is improved. That is, in this embodiment, the FC layer is used for training in the training stage, and FC obtains high-dimensional information; in order to ensure the uniformity of output dimension reduction, principal component analysis is adopted in the retrieval and identification stage, and the generalization capability of the network is effectively improved. In this embodiment, the training and recognition modules form a principal component analysis module, rather than using principal component analysis to reduce dimensions as part of remote sensing networks in the prior art, for example, a parcel crop recognition method combining remote sensing image time sequence and texture features disclosed in the invention with publication number CN112395914A is a typical application method for reducing dimensions using principal component analysis alone.

Fig. 4 and fig. 5 are comparison results of the FGR-AM method of the present embodiment with several remote sensing scene recognition methods commonly used at present. It can be found that the FGR-AM scene network of the present embodiment has higher recognition accuracy.

Example two

The embodiment of the invention provides an FGR-AM system for remote sensing scene recognition, which comprises an FGR-AM remote sensing scene network and an FGR-AM remote sensing scene network training module.

The FGR-AM remote sensing scene network comprises 5 bottleneck convolution modules, a first channel attention module, a first spatial attention module, a second channel attention module, a second spatial attention module, a bilinear feature fusion module and a principal component analysis module. And the FGR-AM remote sensing scene network training module is used for replacing the main component analysis module with a full connection layer to train the FGR-AM remote sensing scene network.

The 5 bottleneck convolution modules are sequentially connected and used for carrying out feature extraction on the input original remote sensing image; the input end of the first channel attention module is connected with the output end of the 3 rd bottleneck convolution module, and the output end of the first channel attention module is connected to the bilinear feature fusion module through the first spatial attention module; the input end of the second channel attention module is connected with the output end of the 5 th bottleneck convolution module, and the output end of the second channel attention module is connected to the bilinear feature fusion module through the second spatial attention module; the first channel attention module and the second channel attention module are respectively used for performing effective information enhancement processing and ineffective information suppression processing on the image features extracted by the 3 rd and 5 th bottleneck convolution modules; the first spatial attention module is used for simultaneously extracting contour information contained in the remote sensing image and more interesting features in vision from the image features extracted by the 3 rd bottleneck convolution module; the second spatial attention module is used for extracting detail features contained in the remote sensing image from the image features extracted by the 5 th bottleneck convolution module; the bilinear feature fusion module is used for gathering the channel attention and the space attention enhancement features, fusing the extracted contour information contained in the remote sensing image, the features which are more interesting in vision and the detail features contained in the remote sensing image, and forming a bilinear vector with global space and channel consistency representation; and the principal component analysis module is used for mapping the multidimensional characteristics generated by the bilinear characteristic fusion module to orthogonal k-dimensional characteristics and identifying and classifying the remote sensing image.

Optionally, the bottleneck convolution module includes a standard convolution layer, a Depthwise convolution layer and a linear convolution layer, which are connected in sequence; the convolution kernel of the standard convolution layer is 1 multiplied by 1, the activation function is swish, and the channel expansion is n times of the number of the basic channels; the convolution kernel of the Depthwise convolution layer is 3 multiplied by 3, the step length is 2, and the number of channels is maintained to be n times of the number of basic channels; the convolution kernel of the linear convolution layer is 1 multiplied by 1, and the number of channels is reduced to the number of basic channels; the number of basic channels of the 5 bottleneck convolution modules is 64, 128, 256, 512 and 512, respectively, in image processing order; wherein, the value of the expansion value n corresponding to the 1 st and the 2 nd bottleneck convolution modules is 6; the value of the expansion value n corresponding to the 3 rd and 4 th bottleneck convolution modules is 4; the value of the expansion value n corresponding to the 5 th bottleneck convolution module is 2.

C denotes the number of channels, h denotes the height of the input feature map, and w denotes the width of the input feature map. Inputting two characteristic dimensions of 1 multiplied by c after maximum pooling and average pooling into a shared MLP, wherein the first layer and the second layer of the MLP are respectively c/16 and c. The obtained two feature vectors are subjected to weight addition, and the sigmoid function is used for calculating the channel attentionA weight matrix of forces, to obtain

。

Optionally, the process of extracting features by the first spatial attention module or the second spatial attention module comprises the following steps:

To carry out

After operation, a new weight matrix is obtained

(ii) a S32, weighting the matrix

(ii) a Wherein the content of the first and second substances,

operated as

F，

Representing element-level multiplication; s33, mixing

And

to carry out

After operation, a new weight matrix is obtained

(ii) a Wherein the content of the first and second substances,

operated as

。

Through the FGR-AM system of the second embodiment of the invention, the transmission object is determined by establishing the data containing relation of the whole application, and the aim of identifying and detecting the remote sensing scene image is achieved. The FGR-AM system provided by the embodiment of the invention can execute the FGR-AM method for remote sensing scene identification provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. An FGR-AM method for remote sensing scene recognition, characterized in that the FGR-AM method comprises the following steps:

s2, connecting a channel attention module and a space attention module respectively after the 3 rd bottle neck convolution module and the 5 th bottle neck convolution module to form two channels;

the channel attention module is used for enhancing effective information in a corresponding channel and inhibiting ineffective information;

the spatial attention module is used for simultaneously extracting contour information and more interesting features in vision contained in the remote sensing image from the image features extracted by the 3 rd bottleneck convolution module and extracting detail features contained in the remote sensing image from the image features extracted by the 5 th bottleneck convolution module;

s3, inputting the final weight matrix obtained by the two channels through the spatial attention module into a bilinear fine-grained fusion feature module, namely fusing the extracted contour information contained in the remote sensing image, the features more interesting in vision and the detailed features contained in the remote sensing image to form a bilinear vector with global space and channel consistency representation;

and S4, adopting a principal component analysis module to map the multidimensional characteristics generated in the step S3 to orthogonal k-dimensional characteristics, and identifying and classifying the remote sensing images.

2. The FGR-AM method for remote sensing scene recognition of claim 1, wherein in step S1, the feature extraction process of the bottleneck convolution module comprises the following steps:

3. The FGR-AM method for remote sensing scene recognition of claim 2, wherein, in order of processing, the number of fundamental channels of the 5 bottleneck convolution modules is 64, 128, 256, 512 and 512;

4. The FGR-AM method for remote sensing scene recognition according to claim 1, wherein in step S2, the process of the channel attention module enhancing the valid information and suppressing the invalid information in the corresponding channel comprises the following steps:

s21, aiming at the 3 rd or 5 th bottleneck convolution module, respectively carrying out maximum pooling and average pooling on the features F extracted by the corresponding bottleneck convolution module, wherein the feature dimension after pooling is 1 × 1 × c, and F belongs to R^c×h×wC represents the number of channels, h represents the height of the input feature map, and w represents the width of the input feature map;

s23, performing weight addition on the two obtained feature vectors, and calculating a weight matrix of channel attention by using a sigmoid function to obtain M_C∈R^c×1×1。

5. The FGR-AM method for remote sensing scene recognition according to claim 4, wherein in step S2, the process of extracting contour information and features of more interest in vision simultaneously from the image features extracted by the 3 rd bottleneck convolution module and extracting detail features contained in the remote sensing image from the image features extracted by the 5 th bottleneck convolution module by the spatial attention module comprises the following steps:

s31, extracting the weight matrix M_CAfter F1 operation, obtaining a new weight matrix F';

s32, the weight matrix F' is convoluted by 7 multiplied by 7, the sigmoid function is used to calculate the weight matrix of the space attention, and M is obtained_S∈R^1×h×w(ii) a Wherein F1 is calculated as

Representing element-level multiplication;

s33, mixing M_SF2 operation is carried out on the weight matrix F 'to obtain a new weight matrix F'; wherein F2 is calculated as

6. An FGR-AM system for remote sensing scene recognition based on the method of claim 1, wherein the FGR-AM system comprises:

the FGR-AM remote sensing scene network training module is used for replacing the main component analysis module with a full connection layer to train the FGR-AM remote sensing scene network;

the bilinear feature fusion module is used for gathering the channel attention and the space attention enhancement features, and fusing the extracted contour information contained in the remote sensing image, the features more interesting in vision and the detail features contained in the remote sensing image to form a bilinear vector with global space and channel consistency representation;

7. The FGR-AM system for remote sensing scene recognition of claim 6, wherein the bottleneck convolution module comprises a standard convolution layer, a Depthwise convolution layer and a linear convolution layer connected in sequence;

8. The FGR-AM system for remote sensing scene recognition of claim 6, wherein the first and second channel attention modules performing effective information enhancement and ineffective information suppression processing on the image features extracted by the 3 rd and 5 th bottleneck convolution modules comprises the following steps:

for the 3 rd or 5 th bottle neckA convolution module for performing maximum pooling and average pooling treatment on the features F extracted by the corresponding bottleneck convolution module respectively, wherein the feature dimension after pooling is 1 × 1 × c, wherein F belongs to R^c×h×wC represents the number of channels, h represents the height of the input feature map, and w represents the width of the input feature map;

carrying out weight addition on the two obtained eigenvectors, calculating a weight matrix of channel attention by using a sigmoid function, and obtaining M_C∈R^c×1×1。

9. The FGR-AM system for remote sensing scene recognition of claim 8, wherein the first or second spatial attention module feature extraction process comprises the steps of:

s31, extracting the weight matrix M from the first channel attention module or the second channel attention module_CAfter F1 operation, obtaining a new weight matrix F';

Representing element-level multiplication;