CN114359613A

CN114359613A - Remote sensing image scene classification method based on space and multi-channel fusion self-attention network

Info

Publication number: CN114359613A
Application number: CN202011093081.6A
Authority: CN
Inventors: 陈志华; 刘韵娜; 刘潇丽; 胡灼亮; 仇隽
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2020-10-13
Filing date: 2020-10-13
Publication date: 2022-04-15

Abstract

The invention relates to the technical field of video image processing, and provides a remote sensing image scene classification method based on a space and multi-channel fusion attention mechanism, which comprises the following steps: a. extracting residual characteristic information of the remote sensing image by using a ResNet network; b. performing feature mapping on the foreground content and the background content by using a spatial self-attention network to obtain spatial mapping features; b. performing multi-channel multi-scale feature mapping on the residual feature information by using a multi-channel fusion self-attention network to obtain multi-channel fusion mapping features; c. synthesizing the extracted space mapping characteristics and the multi-channel fusion mapping characteristics; d. and classifying the synthesized mapping characteristics by using a width classifier to obtain a classification result. The remote sensing image scene classification method based on space and multi-channel fusion can effectively improve the average precision of the remote sensing scene classification data set and reduce the calculation cost.

Description

Remote sensing image scene classification method based on space and multi-channel fusion self-attention network

Technical Field

The invention relates to the technical field of image processing, in particular to a scene classification method of a remote sensing image.

Background

The purpose of the remote sensing image scene classification is to assign a specific semantic class to the remote sensing image. The remote sensing image scene classification technology is concerned by the application potential in the fields of city monitoring, environment detection, geographic structure analysis and the like. The network framework for these tasks generally consists of two basic networks, namely a feature mapping network and a class classification network.

In recent years, due to the strong feature learning capability of a deep Convolutional Neural Network (CNN), the accuracy of a remote sensing image scene classification task is remarkably improved. The existing common methods for remote sensing image scene classification tasks can be roughly divided into two types: based on the original CNN method and the feature mapping method. The first method classifies classes directly using previous CNN networks, extracting features only from the last layer of the deep convolutional neural network. Feature mapping methods typically first encode these features to improve the performance of scene classification. In fact, the previous remote sensing image scene classification method proves that the semantic features extracted from different layers of hierarchy obviously improve the classification precision. In addition, the complicated image background information also causes considerable obstacles, and the spatial information can significantly improve the object detection problem. However, existing approaches typically ignore multi-layer structure and spatial region information.

Disclosure of Invention

Aiming at the defects of the remote sensing image scene method in the prior art, the invention provides an end-to-end remote sensing image scene classification method to integrate a multilayer characteristic structure and spatial information. In addition, the classification method provides a new classifier, namely a width classifier, which is used for identifying the class label of the remote sensing image. First, a new algorithm is provided for the problem of misrecognition caused by incomplete feature representation, namely, the relationship between feature mappings across scales is enhanced through a multi-channel fusion attention module. Second, for error detection problems caused by complex backgrounds, the method provides a spatial attention mechanism and adds to the original feature mapping network. In the final stage, the width classifier is trained using the remapped feature information. In addition, the width classifier, which is composed of a width learning system (BLS), can effectively reduce training time while maintaining classification performance.

In order to achieve the purpose, the remote sensing image scene classification method based on the spatial and multi-channel fusion self-attention network comprises the following steps:

step S1: providing a remote sensing image to be classified, preprocessing the remote sensing image and acquiring corresponding mapping characteristics.

Step S2: building a spatial self-attention network, wherein the network comprises a convolutional layer and an excitation layer; and inputting the mapping characteristics into a spatial self-attention network to obtain spatial mapping characteristics.

Step S3: building a multi-channel fusion self-attention network, wherein the network comprises an upper sampling layer, a connecting layer, a fusion layer, a splitting layer, a lower sampling layer and a convolution layer; and inputting the mapping characteristics into a multi-channel fusion self-attention network to obtain the multi-channel fusion characteristics.

Step S4: and synthesizing the spatial mapping feature and the multi-channel fusion feature.

Step S5: building a width classifier, wherein the classifier comprises a width feature mapping layer and a width feature enhancement layer; and inputting the synthesized spatial mapping characteristics and multi-channel fusion characteristics into the spatial classifier to obtain a classification result.

Optionally, in an embodiment of the invention, the input of the step S1 is a remote sensing image of H × W (H, W respectively represent the length and width of the remote sensing image), and the multi-layer residual mapping features of the image are extracted through a ResNet network and are represented by R-Conv-1-4.

Optionally, in an embodiment of the present invention, the input of the step S2 is a first-layer residual mapping feature R-Conv-1, and the spatial self-attention weight is obtained by a convolution operation connected to an activation function, and the step may be represented as:

S＝sigmoid(A_sF)*F

wherein S is the space self-attention mapping characteristic of the output, sigmoid is a sigmoid activation function, A_SFor the convolution kernel in the convolution layer, F ∈ R^H×W×C(C is the feature dimension) is the input residual mapping feature.

Optionally, in an embodiment of the present invention, the step S3 includes:

step S31: the input is the residual mapping feature of layer 2-4 extracted from ResNet, denoted R-Conv-2-4, and R-Conv-3-4 is upsampled to the same size as R-Conv-2.

Step S32: inputting the aligned mapping feature map into the connection layer, and forming a complementary channel fusion feature through the fusion layer.

Step S33: inputting the channel fusion characteristics into a segmentation layer, segmenting the complementary fusion characteristics, and restoring the complementary fusion characteristics to the original size.

Optionally, in an embodiment of the present invention, the synthesis method in step S4 is as follows:

F′＝F+sigmoid((W_s*F)*(W_c*F))

wherein F' is a synthetic feature, W_sFor spatial self-attention network weights, W_cFor the multi-channel fusion self-attention network weights, matrix multiplication is represented.

Optionally, in an embodiment of the present invention, the step S5 includes:

s51: the input of the width classifier is a synthesized feature F', and the corresponding width mapping feature can be expressed as:

wherein M is_iFor the ith width-mapping feature,

to activate a function, W_siTo width mappingConvolution kernel, beta_siFor randomly generated bias terms, n is the total number of feature points. Thus, all width mapping features may be denoted as Mⁿ＝[M₁,M₂,...,M_n]。

S52, inputting the width characteristic enhancement layer as a width mapping characteristic, and obtaining the m-th group of width enhancement characteristics which satisfy the following expression:

H_m＝σ(MⁿW_hm+β_hm)

wherein H_mFor the mth group of width-enhancing features, σ is the activation function, W_hmFor width-enhanced convolution kernels, beta_hmIs a randomly generated bias term.

S53: the output classification result can be expressed as:

Y＝[M₁,M₂,...,M_n|σ(MⁿW_h1+β_h1),...,σ(MⁿW_hm+β_hm)]W^m

wherein W^mRepresenting the connection weights of the width mapping node and the width enhancing node.

The remote sensing image scene classification method provided by the invention comprises a space self-attention network, a multi-channel fusion self-attention network and a width classifier. The core idea of the method is to fully utilize the relation between different layers and the spatial information of the target object. Compared with the prior art, the remote sensing image scene method at least comprises the following advantages:

1) the invention adopts the multi-channel fusion self-attention network, fully utilizes the relation among different network layers and improves the classification accuracy.

2) The invention adopts the space self-attention network, and can effectively extract useful information from the complex background image.

3) The invention adopts the width classifier as a flat network, does not need to pass through a complex reverse transmission process during training, can obviously accelerate the training speed and reduce the training time.

Drawings

FIG. 1 is a flow chart of a remote sensing image scene classification method based on a spatial and multi-channel fusion self-attention network;

FIG. 2 shows a network design block diagram of the remote sensing image scene classification method based on the spatial and multi-channel fusion self-attention network;

FIG. 3 is a schematic diagram showing the comparison between the remote sensing image scene classification method of the present invention and the remote sensing image scene classification of other methods.

Detailed description of the invention

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Referring to fig. 1, the method of an embodiment of the present invention operates as follows: s1, extracting mapping characteristics of a remote sensing image; s2, extracting spatial features; s3, extracting multi-channel fusion characteristics; s4, synthesizing spatial features and multi-channel fusion features; and S5, classifying the remote sensing image scene.

For step S1, the present invention employs a ResNet network to extract a multi-scale feature map. The structure of our proposed method is shown in figure 1. X is formed by R^H×WRepresenting the input image (H and W representing the width and height of the image, respectively), the residual feature map learned from the ResNet network is denoted as F ∈ R^H×W×C。

For step S2, the present invention proposes a spatial attention mechanism. As shown in FIG. 1, the input features of the spatial attention Module are mapped to F ∈ R^H×W×CThe output is S ∈ R^H×W×C. The spatial attention mechanism can be expressed as:

S＝sigmoid(A_sF)*F

wherein sigmoid is a sigmoid activation function, A_sIs the convolution kernel of the spatial attention module.

For step S3, the feature map reflects the semantic and structural information of the remote sensing image, which is of great significance for the visual classification and recognition task. Furthermore, feature maps from different levels tend to have different features. Therefore, the invention provides a channel fusion self-attention network, which integrates feature mapping of different layers to promote different discrimination capabilities of scene types. As shown in FIG. 2, we represent the residual features extracted in ResNet as R-Conv-1 ~ 4 and upsample R-Conv-3 ~ 4 to the same size as R-Conv-2. Then, the aligned feature maps are connected together to form a new channel mapping information, and the enhancement is performed among all the layers. We then split and restore the fused block to its original size. In fact, this operation enhances the relationship of feature mapping between these different levels, with complementarity.

For step S4, it is important for the image classification task to select an appropriate way to represent the semantic information of the image. We propose to use a spatial and multi-channel fusion self-attention mechanism to solve the image representation problem. Image recognition is also important for image classification. Accordingly, the present invention provides a width classifier to identify image classes. A flat network called the width learning system is used to perform the image classification task. The whole classification process is divided into two steps of feature mapping and node enhancement.

The input of the width classifier is a synthesized feature F', and the corresponding width mapping feature can be expressed as:

wherein M is_iFor the ith width-mapping feature,

to activate a function, W_siFor width-mapped convolution kernels, beta_siFor randomly generated bias terms, n is the total number of feature points. Thus, all width mapping features may be denoted as Mⁿ＝[M₁,M₂,...，M_n]。

The input of the width characteristic enhancement layer is a width mapping characteristic, and the obtained mth group of width enhancement characteristics satisfy the following expression:

H_m＝σ(MⁿW_hm+β_hm)

The output classification result can be expressed as:

Y＝[M₁，M₂,...，M_n|σ(MⁿW_h1+β_h1)，...，σ(MⁿW_hm+β_hm)]W^m

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A remote sensing image scene classification method based on a space and multi-channel fusion attention mechanism is characterized by comprising the following steps:

2. The method according to claim 1, wherein the input of the step S1 is a remote sensing image H x W (H, W respectively represents the length and width of the remote sensing image), and the multilayer residual mapping characteristics of the image are extracted through a ResNet network and are represented by R-Conv-1-4.

3. The method according to claim 1, wherein the input of step S2 is a first layer residual mapping feature R-Conv-1, and the spatial self-attention weight is obtained by a convolution operation connected to an activation function, and the steps are expressed as:

S＝sigmoid(A_sF)*F

4. The method according to claim 1, wherein the step S3 includes:

5. The method according to claim 1, wherein the synthesis method in step S4 is as follows:

F′＝F+sigmoid((W_s*F)*(W_c*F))

6. The method according to claim 1, wherein step S5 includes:

wherein M is_iFor the ith width-mapping feature,

to activate a function, W_siFor width-mapped convolution kernels, beta_siFor randomly generated bias terms, n is the total number of feature points. Thus, all width mapping features may be denoted as Mⁿ＝[M₁，M₂，...，M_n]。

H_m＝σ(MⁿW_hm+β_hm)

S53: the output classification result can be expressed as:

Y＝[M₁，M₂，...，M_n|σ(MⁿW_h1+β_h1)，...，σ(MⁿW_hm+β_hm)]W^m

wherein W^mRepresenting a connection of a width mapping node and a width enhancing nodeAnd (4) weighting.