CN113361546A

CN113361546A - Remote sensing image feature extraction method integrating asymmetric convolution and attention mechanism

Info

Publication number: CN113361546A
Application number: CN202110679806.8A
Authority: CN
Inventors: 董张玉; 张鹏飞; 张远南; 张晋; 安森; 于金秋; 李金徽
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-09-07

Abstract

The invention discloses a remote sensing image feature extraction method fusing asymmetric convolution and an attention mechanism, which comprises the following steps: (1) acquiring remote sensing image data of the features to be extracted; (2) generating a first neural network model, wherein a network architecture of the first neural network model adopts ResNet50 comprising five feature extraction modules, the first feature extraction module comprises a convolution layer, and the second to fourth feature extraction modules respectively comprise a plurality of residual error learning units; (3) adding a mixed domain attention mechanism module into the first neural network model to obtain a second neural network model; (4) and sending the remote sensing image data into a second neural network model to obtain the characteristics of the remote sensing image. The method enhances the robustness of the model in image turnover and rotation in the data set, and improves the extraction capability of the ResNet50 network on the target object characteristics of the remote sensing image.

Description

Remote sensing image feature extraction method integrating asymmetric convolution and attention mechanism

Technical Field

The invention relates to the field of image feature extraction methods, in particular to a remote sensing image feature extraction method fusing an asymmetric convolution and an attention mechanism.

Background

Deep Residual Network (ResNet) was proposed in 2015, which realizes classification through feature extraction, obtains the first classification task on ImageNet, and refreshes the history of CNN model on ImageNet. From experience, as the number of network layers is continuously increased, the network can extract more complex features, but experiments show that the accuracy of the network tends to be saturated or even decreased when the number of network layers is increased. Gradient disappearance and explosion phenomena exist in the Deep network training process, and in a paper "Deep Residual Learning for Image Recognition", zeoming doctor proposes Residual Learning to solve the problem, and changes several layers of the original network into a Residual Learning unit by using identity mapping (specific construction is shown in fig. 1).

In the residual learning unit proposed by doctor hokeming, the calculation formula is as follows:

x_i+1＝x_i+F(x_i,W_i) (1)

in the formula (1), x_iAs input to the residual learning unit, W_iIs the weight of the residual learning unit, F (x)_i,W_i) For residual mapping, x_i+1Is the output of the residual learning unit. From the output of the residual error learning unit, it can be observed that the performance of the model at least does not decline when the number of network layers is deepened, but the model cannot obviously distinguish a target object of the remote sensing image during feature extraction, and the key feature extraction of the modelThe extraction capacity needs to be further improved.

Disclosure of Invention

The invention aims to provide a remote sensing image feature extraction method fusing an asymmetric convolution and attention mechanism, and aims to solve the problem that a target object of a remote sensing image cannot be distinguished remarkably during feature extraction in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the remote sensing image feature extraction method fusing the asymmetric convolution and the attention mechanism comprises the following steps:

(1) acquiring remote sensing image data of the features to be extracted;

(2) generating a first neural network model, wherein a network architecture of the first neural network model adopts ResNet50 comprising five sequentially-connected feature extraction modules in series, the first feature extraction module comprises a convolutional layer, the second feature extraction module comprises a convolutional layer formed by three sequentially-connected residual error learning units, the third feature extraction module comprises a convolutional layer formed by four sequentially-connected residual error learning units, the fourth feature extraction module comprises a convolutional layer formed by six sequentially-connected residual error learning units, and the fifth feature extraction module comprises a convolutional layer formed by three sequentially-connected residual error learning units;

each residual error learning unit comprises three convolution kernel units which are sequentially connected in series, wherein the first convolution kernel unit is a convolution kernel with the size of 1 multiplied by 1; the second convolution kernel unit is formed by connecting three convolution kernels with the sizes of 3 multiplied by 3, 1 multiplied by 3 and 3 multiplied by 1 in parallel; the third convolution kernel unit is a convolution kernel with the size of 1 multiplied by 1; in each residual error learning unit, a first convolution kernel sub-unit is used for carrying out dimension compression, a second convolution kernel sub-unit is used for carrying out convolution processing, and a third convolution kernel sub-unit is used for carrying out dimension recovery in sequence;

(3) respectively connecting the output of each residual error learning unit in the first neural network model generated in the step (2) with a mixed domain attention mechanism module to obtain a second neural network model; the mixed domain attention mechanism module comprises a feature map extraction sub-module, a fusion sub-module, a decomposition sub-module, a Sigmoid activation function sub-module and a Scale operation sub-module, wherein:

the feature map extraction submodule extracts feature maps in the horizontal direction and the vertical direction from the output of the corresponding residual error learning unit;

the fusion submodule performs characteristic fusion on the characteristic graphs in the horizontal direction and the vertical direction to obtain a fusion result;

the decomposition submodule decomposes the fusion result according to the dimension in the horizontal direction and the dimension in the vertical direction to obtain decomposition results in the horizontal direction and the vertical direction;

the Sigmoid activation function submodule carries out activation processing on decomposition results in the horizontal direction and the vertical direction;

the Scale operation sub-module carries out Scale operation on the activation processing result of the Sigmoid activation function sub-module;

(4) and (3) sending the remote sensing image data obtained in the step (1) into the second neural network model in the step (3), and extracting the features of the remote sensing image after the processing of the second neural network model.

According to the remote sensing image feature extraction method fusing the asymmetric convolution and the attention mechanism, in the second convolution kernel unit of each residual learning unit, the outputs of three parallel convolution kernels are subjected to batch normalization processing and then added to serve as the output of the second convolution kernel unit.

In the method for extracting the features of the remote sensing image fusing the asymmetric convolution and the attention mechanism, in the mixed domain attention mechanism module in the step (3), the feature map extraction sub-module firstly decomposes the output of the residual error learning unit into one-dimensional feature tensors in the horizontal direction and the vertical direction, and performs global pooling operation on the two one-dimensional feature tensors so as to respectively aggregate the two one-dimensional feature tensors along the horizontal direction and the vertical direction to obtain the one-dimensional feature map in the corresponding direction.

In the method for extracting the remote sensing image features by fusing the asymmetric convolution and the attention mechanism, in the mixed domain attention mechanism module in the step (3), a fusion submodule processes feature graphs in the horizontal and vertical directions by adopting two full-connection and nonlinear ReLU operations, so that the feature graphs in the horizontal and vertical directions are subjected to feature fusion.

A remote sensing image feature extraction system comprising a processor and a memory, wherein the memory stores program instructions capable of being identified and executed by the processor, and the processor executes the program instructions to execute the remote sensing image feature extraction method according to claim 1.

The remote sensing image feature extraction system is characterized in that the program instruction comprises a first subprogram, a second subprogram and a third subprogram, the step (1) is executed when the processor runs the first subprogram in the program instruction, the steps (2) and (3) are executed when the processor runs the second subprogram in the program instruction, and the step (4) is executed when the processor runs the third subprogram in the program instruction.

Compared with the prior art, the invention has the advantages that:

the invention provides a remote sensing image feature extraction method fusing an asymmetric convolution and an attention mechanism, which takes a ResNet50 network as a basic network framework, adopts asymmetric convolution in a second convolution kernel unit of a residual error learning unit used in the ResNet50 network to obtain the fused convolution, enhances the robustness of a model in image turnover and rotation in a data set, simultaneously fuses a channel attention feature and a space attention feature to provide a mixed domain attention mechanism for obtaining feature position information, and improves the extraction capability of the ResNet50 network on remote sensing image target object features.

Experiments prove that the overall classification precision of the UCMercered _ LandUse data and the overall classification precision of the data are 96.43%, and the overall classification precision of the NWPU-RESISC45 data set is 92.71%, so that the classification effect of the original network based on feature extraction is greatly improved.

Drawings

Fig. 1 is a schematic diagram of a prior art residual learning unit.

Fig. 2 is a schematic diagram of a ResNet50 residual learning unit.

FIG. 3 is a schematic diagram of the fused 3 × 3 convolution of the present invention.

FIG. 4 is a schematic diagram of a residual learning unit using asymmetric convolution according to the present invention.

FIG. 5 is a schematic diagram of a prior art SE module.

FIG. 6 is a schematic diagram of the structure of SE-ResNet.

FIG. 7 is a schematic diagram of a SCAM module of the present invention.

FIG. 8 is a schematic diagram of the SCAM _ ResNet structure of the present invention.

FIG. 9 is a sample of a portion of the UCMercered _ LandUse data set.

FIG. 10 is a partial sample of a NWPU-RESISC45 data set.

FIG. 11 shows the result of the change of the accuracy and loss value of the UCMercered _ LandUse data set with the number of cycles.

Fig. 12 is a comparison of ResNet with AC _ SCAM _ ResNet50 for different training set ratios.

FIG. 13 shows the results of experimental two accuracy and loss values as a function of cycle number for the NWPU-RESISC45 data set.

FIG. 14 shows experimental accuracy and loss values as a function of cycle number for the NWPU-RESISC45 data set.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The invention comprises the following steps:

(1) acquiring remote sensing image data of the features to be extracted;

(2) generating a first neural network model based on a ResNet50 network;

(3) generating a second neural network model based on the first neural network model;

The invention also discloses a remote sensing image characteristic extraction system, which can be any electronic system such as a computer, a server and the like provided with a processor and a memory, wherein the memory is stored with a program instruction which can be identified and operated by the processor, the program instruction comprises a first subprogram, a second subprogram and a third subprogram, the step (1) is executed when the processor operates the first subprogram in the program instruction, the steps (2) and (3) are executed when the processor operates the second subprogram in the program instruction, and the step (4) is executed when the processor operates the third subprogram in the program instruction.

In step (2) of the present invention, the first neural network model is mainly improved by using a ResNet50 network structure, the ResNet50 has five feature extraction modules in total, the first feature extraction module has only one convolutional layer for extracting input, the second feature extraction module includes a convolutional layer composed of three residual learning units connected in series in sequence, the third feature extraction module includes a convolutional layer composed of four residual learning units connected in series in sequence, the fourth feature extraction module includes a convolutional layer composed of six residual learning units connected in series in sequence, and the fifth feature extraction module includes a convolutional layer composed of three residual learning units connected in series in sequence.

Compared with ResNet34, the residual learning units of ResNet50 are changed (as shown in fig. 2), and each residual learning unit is formed by connecting three convolution kernels with the sizes of 1 × 1, 3 × 3 and 1 × 1 in series and is respectively applied to the compression dimension, convolution processing and recovery dimension.

Among the properties of convolution is the fact that if several two-dimensional kernels of compatible size are run on the same input in the same step to produce the same output, these kernels can be added at corresponding locations to obtain equivalent kernels that will produce the same output. It can be seen that two-dimensional convolution kernels can be added, even for convolution kernels of different sizes.

Therefore, Christian szegdy in the paper "correlation architecture for computer vision" proposed that any n × n convolution can be replaced by a 1 × n convolution followed by an n × 1 convolution, which can save the amount of parameter computation, but with a significant reduction in accuracy.

To improve this phenomenon, the present invention proposes Asymmetric Convolution (Asymmetric Convolution). In asymmetric convolution, three parallel n × n square convolution kernels, a 1 × n horizontal convolution kernel and an n × 1 vertical convolution kernel are used instead of n × n convolution, and the 1 × n and n × 1 convolutions are used to enhance the robustness of the model to image flipping and rotation in the data set, so that the outputs of different layers are combined to improve the quality of the learning representation. Therefore, the ResNet50 network of the present invention uses asymmetric convolution, and 3 × 3 convolution kernels with a size of 3 × 3 in each residual learning unit are replaced by three parallel convolution kernels, i.e., 3 × 3, 1 × 3, and 3 × 1, to obtain a 3 × 3 convolution after fusion, as shown in fig. 3.

In step (1), similar to the standard CNN, each residual learning unit is followed by a Batch Normalization operation (Batch Normalization), and the parallel outputs of the three parallel convolution kernels of 3 × 3, 1 × 3, and 3 × 1 are used as features to be extracted, and the residual learning unit using asymmetric convolution is shown in fig. 4.

The Attention Mechanism (Attention Mechanism) is thought to let the network learn Attention, can ignore irrelevant information and pay Attention to important information, and essentially weights regions and highlights prominent regions. In recent years, the research work of deep learning in combination with attention mechanism has mostly focused on the use of masks (masks) to form the attention mechanism. The principle of the mask is that key features in the image data are identified through another layer of new weight, and the deep neural network learns the region needing attention in each new image through learning training, so that attention is brought to the user. Attention mechanisms can be divided into two categories, one is soft attention (soft attention) and the other is hard attention (hard attention). The key point of soft attention is that this attention is more focused on areas or channels, and soft attention is a deterministic attention that can be generated directly through the network after learning is complete. The strong attention is the attention that the focus extends out, and the strong attention is a random prediction process, and the more emphasis is on dynamic change, and the training process is usually completed by reinforcement learning (strengthening learning).

Currently, the Attention machine of the hot door is equipped with a Channel Attention Module (Channel Attention Module), a Spatial Attention Module (Spatial Attention Module) and a mixed domain Attention Module (Mix Attention Module). The attention of the channel is to generate a mask for the channel, so as to generate different weights, wherein the weights represent the relevance of the channel and the key information, and the larger the weight is, the higher the relevance is, namely, the more attention is needed to the channel. The method is represented by SENEt, which mainly learns the correlation among channels by explicitly modeling the interdependence relation among the channels, screens out the attention facing the channels and adaptively recalibrates the characteristic response of the channels. The spatial attention is to generate a mask for the space, and to use the spatial relationship between different feature maps to make the model pay attention to which feature space positions of the feature maps. A Spatial Transformer Networks (STN) model is represented. The spatial transformer (spatial transformer) module performs corresponding spatial transformation on spatial domain information in the picture, so that key information can be extracted, and therefore the STN model can transform the spatial information in the original picture into another space and retain the key information. Comparing the two kinds of attention, the spatial attention is to omit the information in the channel domain and equally process the picture features in each channel, and this method will limit the spatial domain transformation method to the original picture feature extraction stage and has low interpretability when applied to other layers of the neural network. The channel attention is to pool information in one channel directly and globally, and to ignore local information in each channel, which is also a violent behavior. A mixed domain attention mechanism model is proposed. Representative is CBAM (conditional Block Attention module), which integrates a channel Attention module and a space Attention module in sequence to learn what to pay Attention and where to pay Attention in channel and space dimensions, respectively, and to emphasize meaningful features in both the space and channel dimensions.

In SE-ResNet, a SE module is embedded in a ResNet network, and a global Average pooling gap (global Average gap) is used as the Squeeze operation, that is, an input of H × W × C is converted into an output of 1 × 1 × C, so that a global receptive field can be obtained. The formula for the Squeeze operation on its C-th channel at a given input X is:

in the formula (2), x_c(i, j) is the input image sample, X_cFor the channel features of the input image, W is the feature dimension, H is the feature height, F_sqTo perform the Squeeze operation. In the Excitation operation, two full connections are used to output and input weights for the same number of features, the first one W₁Is C/r C, r is a scaling parameter to reduce the number of channels and thus the amount of computation, followed by a nonlinear ReLU operation to leave the output dimension unchanged, and then a second fully-connected W₂The dimension of (d) is C × C/r, so the output dimension is 1 × 1 × C, and the output obtained through the Sigmoid activation function is:

s＝F_ex(z_c,W)＝σ(W₂ReLU(W₁z_c)) (3)

in formula (3), σ is Sigmoid activation function, W₁Is the dimension of the first fully-connected layer, W₂Is the dimension of the second fully-connected layer, z_cFor output via the Squeeze operation, F_exTo perform an Excitation operation.

The output of the Excitation operation represents the weight of the feature map in input X, which is derived from the previous full join and ReLU operations, and thus can be trained end-to-end. The last Scale operation is to multiply the channels, i.e.:

F_scale(x_c,s)＝x_c*s (4)

in the formula (4), x_cS is the output image after the specification operation, which is the original input image sample. SE module diagram as shown in fig. 5, the flexibility of SE module makes it possible to embed into inclusion or ResNet, and embedding into ResNet network results in SE _ ResNet structure as shown in fig. 6.

Although SE-ResNet considers that the importance of each channel is re-weighed by modeling the channel relation, the SE-ResNet ignores position information, so the method improves the importance in the step (3), and a new mixed domain Attention mechanism module SCAM (Split and Concat Attention mechanism) is provided on the basis of an SE module, so that a third neural network is obtained, the accurate position information of a feature map is fused into the Attention of the channel, the representation of the feature can be enhanced, the feature can be more accurately positioned and identified, and the accurate position information of the feature to be concerned can be obtained.

In step (3), a mixed domain attention mechanism module (hereinafter abbreviated as SCAM) decomposes the output of the corresponding residual learning unit in step (2) of the present invention into two one-dimensional feature tensors, and after global pooling of the two one-dimensional feature tensors is performed, the input information is respectively deaggregated into feature maps in two directions along the horizontal direction and the vertical direction, so that the output in the horizontal direction is the output in the horizontal direction

In the formula x_c(h, i) are sample objects to which the input image is mapped in the horizontal direction. The vertical direction output is:

in the formulae (5) and (6), x_c(j, W) is a sample object of the input image mapped in the vertical direction, W is the dimension of the characteristic horizontal direction, and H is the height of the characteristic vertical direction.

Then, the feature maps in two directions are fused, and two fully-connected and nonlinear ReLU operations are used as the SE module, and the output is:

F_fc＝W₂ReLU(W₁[z_c(h),z_c(w)]) (7)

[ z ] in the formula (7)_c(h),z_c(w)]For feature fusion operations, W₁Is the dimension of the first fully-connected layer, W₂The dimension of the second fully connected layer.

Then, the fused feature maps are respectively arranged along the dimension W of the horizontal direction_hIs C × H, the vertical dimension W_wDecomposing for W × C, and obtaining by Sigmoid function activation processing:

s_w＝σ(W_wF_fc) (8)

s_h＝σ(W_hF_fc) (9)

in the formulas (8) and (9), sigma is Sigmoid activation function s_wFor the vertical output after decomposition of the fused feature map, s_hFor the output in the horizontal direction after decomposition of the fused feature map, W_wIs a characteristic dimension of vertical direction, W_hIs a horizontal direction characteristic dimension.

The Scale operation output that is finally carried out is:

F_scale(x_c,s_w,s_h)＝x_c*s_w*s_h (10)

in the formula (10), x_cFor the original input image sample, s_wFor the vertical output after decomposition of the fused feature map, s_hAnd decomposing the fused feature map to output in the horizontal direction. The SCAM module diagram is shown in FIG. 7, and the structure of SCAM _ ResNet obtained by embedding into the ResNet network is shown in FIG. 8.

The experiment and result analysis of the invention are as follows:

1. data set

The experimental data set adopts remote sensing image scene classification two big data sets UCMercered _ LandUse and NWPU-RESISC 45. The data in UCMerced _ LandUse is selected from aerial images in national city maps of the geological survey bureau of America, and comprises 21 types of scenes such as farmlands, residential areas, forests, oil tanks and the like, each type of scene is composed of 100 color images with the resolution of about 0.3m and the size of 256 multiplied by 256 pixels, the total number is 2100, and FIG. 9 is a partial sample example of the data set.

The NWPU-RESISC45 dataset is a standard for classification of remote sensing image scenes created by the northwest university of industry, more detailed scenes such as islands, ships, churches, power stations and the like are added on the basis of the ucmercd _ LandUse dataset, 45 scene categories are covered, each category consists of 700 images with the size of 256 × 256 pixels and the resolution of 30m to 0.2m, the total number is 31500, and fig. 10 is a partial sample example of the dataset. The NWPU-RESISC45 is more complex in scenario and more difficult and challenging to classify than the UCMerced _ LandUse dataset. As can be seen from fig. 10, the high-resolution remote sensing images have various scene categories, the scene images of different categories have greater similarity, and the images of the same category have greater differences, for example, the images of forests, rivers and the like of the same category have greater differences in color and texture according to the density degree of buildings in residential areas, such as sparse residential areas, medium residential areas, dense residential areas and the like, and the scene categories of airplanes, oil tanks and the like include images with only a single target and images with multiple targets, which increases the difficulty of classification.

2. Experimental setup

The test platform is an Intel (R) i7-7700HK processor, a 32G running memory, and utilizes NVIDIA TeslaP 10016G video memory acceleration operation, and the deep learning framework version adopts Pytroch 1.4. In the training process, an Adam optimizer is adopted, the learning rate is set to be 3e-4, the loss function adopts a cross entropy loss function, the training frequency is 400, and the reliability of the model is ensured. The Loss curve and the accuracy curve in the experimental result are drawn by data obtained by TensorBoard visualization, so that the network convergence condition is analyzed.

In order to verify the effectiveness of the algorithm of the invention and to compare with other thesis results, an equivalent data set partitioning method is adopted. Experiments were performed on the UCMerceded _ LandUse dataset and the NWPU-RESISC45 dataset, respectively. On the UCMercede _ LandUse data set, 80% is randomly selected as a training set, and the rest 20% is used as a testing set. Because the NWPU-RESISC45 data set is huge, two types of experiments are set, 10% of scene images of each type of the experiment are randomly taken as a training set, and the rest 90% are taken as a test set; experiment two 20% of each scene image is randomly selected as a training set, and the rest 80% is used as a testing set.

3. Results and analysis of the experiments

3.1 ResNet series comparative experiments

Comparative experiments were performed on the UCMercered _ LandUse dataset and the NWPU-RESISC45 dataset for ResNet34, ResNet50, ResNeSt, SENet-ResNet, AC _ ResNet50, SCAM _ ResNet50, and AC _ SCAM _ ResNet50, respectively. The results of the experiment are shown in table 1:

table 1 experimental results of ResNet series networks in two large data sets

Under the NWPU-RESISC45 data set, the accuracy of ResNet34 in two experiments is greatly behind that of other networks, and the extraction capability of the characteristics is weak due to the fact that the number of network layers is shallow. AC _ ResNet50 was improved by 0.42% and 0.54% respectively over the original ResNet50, indicating that it is effective after asymmetric decomposition of the 3 x 3 convolution kernel. The ResNeSt and the SENet-ResNet are added with attention mechanisms on the basis of ResNet, the feature extraction capability of the remote sensing image is superior to that of the original ResNet, the SCAM _ ResNet50 is respectively improved by 0.96% and 0.68% compared with the ResNeSt, and is respectively improved by 0.63% and 0.46% compared with the SENet-ResNet, which shows that the addition of the spatial position information on the basis of the SE module is effective, the accurate position information of the feature is obtained while the channel information of the feature is obtained, and the feature extraction is more comprehensive. After the two improved modules are fused, the feature extraction capability of the AC _ SCAM _ ResNet50 network is improved to the greatest extent, the classification accuracy reaches 90.4% and 92.71%, and the classification accuracy is improved by 2.68% and 2.17% respectively compared with the original ResNet 50.

In the process of experiments on the UCMercered _ LandUse data set, the accuracy of ResNet34 is found to be not greatly behind that of ResNet50, because the UCMercered _ LandUse data set is small, the influence of the depth of the network layer number on the feature extraction capability is limited, and at the moment, the classification accuracy cannot be improved by deepening the network layer number, and even overfitting can occur. Therefore, by selecting ResNet50 for experiments, the AC _ ResNet50 is improved by 1.14% compared with the original ResNet50, which shows that the improvement is larger when the improved asymmetric convolution classifies small data sets, and the improved asymmetric convolution enables the features of high and low layers to be fused, so that the feature extraction capability of the network is improved. SCAM _ ResNet50 is improved by 1.42% and 0.85% respectively compared with ResNeSt and SEnet-ResNet, which fully shows that the mixed attention mechanism is more effective than a single attention mechanism and has stronger feature extraction capability. The improvement of the method of the present invention is effective as the AC _ SCAM _ ResNet50 is improved by 3.19% over the original ResNet 50.

3.2 mainstream network contrast experiment

For better comparison, mainstream networks in the field of deep learning and the results of the literature in the same data set are added to participate in the experiment. Classical AlexNet and VGGNet, as well as SegNet, U-Net and DeepLabV2 were added under the UCMerced _ LandUse dataset, as well as documents under the same dataset such as GBRCNN and SE-VGG16, with the classification results as shown in the following table:

TABLE 2 Experimental results of the mainstream network on UCMercered _ LandUse dataset

As can be seen from the results in Table 2, AC _ SCAM _ ResNet50 was improved by 4.07% and 3.21% in classification accuracy over the classical networks AlexNet and VGGNet, respectively, by 3.81%, 3.19% and 3.05% compared to SegNet, U-Net and DeepLabV2, and by 1.9% and 6.98% compared to GBRCNN and SE-VGG 16.

Experiment two is carried out under the NWPU-RESISC45 data set, namely, 20% of images of each type of scene are randomly selected as a training set, and the rest 80% of images are selected as a testing set. DenseNet, FCN, DeepLabV3 and Attention U-Net were added, as well as documents under the same data set such as ECNN and ResNet101-CBAM, with the classification results as shown in the following table:

TABLE 3 Experimental results for mainstream networks in the NWPU-RESISC45 dataset

From the results in Table 3, it can be seen that AC _ SCAM _ ResNet50 is improved by 2.55% and 3.02% in classification accuracy compared with classical networks DenseNet and FCN, 2.32% and 1.17% compared with AttentionU-Net and DeepLabV3, and 2.78% and 0.21% compared with ECNN and ResNet101-CBAM, respectively.

In the result of comparison with the mainstream network, it can be found that the improved asymmetric convolution and the ResNet50 added with the attention mechanism capable of acquiring accurate position information of the features can greatly enhance the extraction capability of the features of the remote sensing image, enhance the representation of the features and better realize scene classification.

3.3 analysis of results

For better analysis of the effectiveness of the method of the present invention in remote sensing image scene classification, fig. 11 shows the variation of accuracy and loss values with cycle number on two experimental data sets.

Fig. 11 is a graph of the accuracy and the loss value of the UCMerced _ LandUse data set varying with the number of cycles, and it can be found by observing the accuracy curve that the classification accuracy tends to be stable after about 128 iterations, the classification accuracy is maintained at about 93%, the training loss function value also tends to be stable, and the accuracy reaches 96.43% after 387 iterations.

Since the UCMerced _ LandUse data set has fewer categories and fewer total numbers, in order to compare the test accuracy under training sets of different proportions, the results obtained by performing experiments on the ResNet50 and the AC _ SCAM _ ResNet50 are plotted as shown in fig. 12. As can be seen from fig. 12, in the improved ResNet50 experiment, the test accuracy increased with the increase of the training set, the improvement of 10% to 20% was particularly obvious, the accuracy was from 69% to 81.37%, and when the training set ratio was 50%, the accuracy of AC _ SCAM _ ResNet50 reached 91.34%, the training set ratio reached the highest at 80%, but when the training set ratio was increased to 90% again, the accuracy decreased by 1.91%, which is the same phenomenon in the ResNet50 experiment because overfitting was generated and the interference generated by too few samples in the test set resulted in the same phenomenon.

Fig. 13 is a graph of the accuracy and the loss value of the second experiment with the change of the cycle number under the NWPU-rescisc 45 data set, and it can be found by observing the curve that the accuracy curve can be found that the classification accuracy tends to be stable after about 99 iterations, the classification accuracy is maintained at about 91%, the training loss function value also tends to be stable, and the accuracy reaches 92.71% after 368 iterations.

Fig. 14 is a graph of the accuracy and the loss value of the first experiment under the NWPU-rescisc 45 data set according to the change of the cycle number, and it can be found by observing the curve that the accuracy curve can be found that the classification accuracy tends to be stable after about 60 iterations, the classification accuracy is maintained at about 88%, the training loss function value also tends to be stable, and the accuracy reaches 90.4% optimally after 360 iterations.

The ResNet50 improved by the method of the invention has good effect on the accuracy of the test data set in the experiment of two data sets, so the ResNet network which integrates the improved asymmetric convolution and attention mechanism in the remote sensing image scene classification is effective.

The embodiments of the present invention are described only for the preferred embodiments of the present invention, and not for the limitation of the concept and scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the design concept of the present invention shall fall into the protection scope of the present invention, and the technical content of the present invention which is claimed is fully set forth in the claims.

Claims

1. The remote sensing image feature extraction method fusing the asymmetric convolution and the attention mechanism is characterized by comprising the following steps of:

(1) acquiring remote sensing image data of the features to be extracted;

2. The method for extracting the features of the remote sensing image fusing the asymmetric convolution and the attention mechanism is characterized in that in the second convolution kernel unit of each residual learning unit, the outputs of three parallel convolution kernels are subjected to batch normalization processing and then added to serve as the output of the second convolution kernel unit.

3. The method for extracting features of remote sensing images fused with asymmetric convolution and attention mechanism according to claim 1, wherein in the mixed domain attention mechanism module in the step (3), the feature map extraction sub-module decomposes the output of the residual learning unit into one-dimensional feature tensors in horizontal and vertical directions, performs global pooling operation on the two one-dimensional feature tensors, and performs aggregation along the horizontal direction and the vertical direction to obtain one-dimensional feature maps in corresponding directions.

4. The method for extracting features of remote sensing images fused with asymmetric convolution and attention mechanism according to claim 1, wherein in the mixed domain attention mechanism module in the step (3), the fusion submodule processes the feature maps in the horizontal and vertical directions by adopting two fully-connected and nonlinear ReLU operations, so that the feature maps in the horizontal and vertical directions are subjected to feature fusion.

5. A remote sensing image feature extraction system comprising a processor and a memory, wherein the memory stores program instructions capable of being recognized and executed by the processor, and wherein the processor executes the program instructions to perform the remote sensing image feature extraction method of claim 1.

6. The remote sensing image feature extraction system according to claim 5, wherein the program instructions comprise a first subprogram, a second subprogram and a third subprogram, the step (1) is executed when the processor runs the first subprogram in the program instructions, the steps (2) and (3) are executed when the processor runs the second subprogram in the program instructions, and the step (4) is executed when the processor runs the third subprogram in the program instructions.